Why Data Science Needs a Change in Perspective

Sahaj Somani
The Startup
Published in
6 min readNov 22, 2020

--

Photo by Paul Calescu on Unsplash

Data is the new oil. Data Scientist is the sexiest job of the 21st century. And fancy-sounding concepts within the field such as machine learning and artificial intelligence are trending buzz words that get a ton of attention. We have been making decisions (and even inference) using our past experiences (aka data) since the beginning of time then why suddenly has data science become the talk of the decade? It’s like that girl with braces in high school who no one gave much attention to but she suddenly became hot shit in college as soon as her braces came off. What have been the implications of this sudden popularity and demand?

So why does data science need a change in perspective? And what is data science in its simplest form?

TLDR; machine learning and models are overrated. If you truly want to be able to solve a problem, find the right kind of data (from anywhere in this digital world), the one which speaks the truth and which will able to answer the question you are looking to answer. Everything else is secondary. PS — finding this right kind of data I mentioned will require some original thinking and a lot of boring work such as scalping, cleaning, and obviously, trial and error!

Data science has two main branches — analysis or hypothesis testing and predictive modelling. The former refers to discovering and testing theories such as more advertising expenditure leads to more revenue and the latter refers to building a decision function using data which helps us predict or classify such as handwriting recognition or auto-tagging people in a photograph. The professionals who deal with the above-mentioned branches are called data analysts and data scientists respectively.

Data science is not a new field, researchers and companies have been collecting data for a long time, primarily through surveys and then analysing it to make conclusions and modelling it using linear models (linear regression) to map relationships. Then why the sudden hype? It is because of the amount and type of data we can now collect. In this new digitalised economy, most things happen on the internet and each users’ every click is data. Having the infrastructure/tools to store and process all this data has given rise to the field of big data and made us more excited.

Due to this popularity, each company wants to hire a data scientist and due to the premium in wage, most people want to become data scientists. This wave has given rise to a lot of new data science college programs, an awesome community, and a ton of online resources and courses but in *my opinion*, all the courses and material out there focus a little too much on model building mostly because it sounds fancy. Given the catchy names, who wouldn’t want to learn how to fit a recurrent neural net to their data huh?! All courses I took in college had to do with fitting models, whether it be data mining or deep learning. All courses online teach you how to fit every kind of model with the ease of the build-in libraries such as sklearn but I want to take a step back for a minute and think about what we are doing and what is it that we are trying to achieve.

I am NOT trying to point out the issue of people these days taking an online course and calling themselves a data scientist without knowing the first assumption behind the linear regression model. That is a topic for another post altogether. Here I want to address where I think we might be going wrong in our approach to solving problems using data. Even if you know what you are doing and fitting the right model, you will get nowhere if your data isn’t worthy, meaning it’s not the truth or it has run its course (other researchers have already found all that data could tell) or one of the other 160 reasons why you find nothing after doing your analysis. I intend to move the focus from the science to the data.

Surveys have barely proved useful in finding the truth and drawing inference solely because people don’t answer honestly even if the surveys are anonymous. Since people are more concerned with being judged, most data collected through surveys is not the truth and hence will not lead you to the right outcome. Similarly picking up clean data sets from the internet and fitting models on them, as we all have done in college, would also get one nowhere because there are researchers who have been working with that data for years and have most likely exhausted any original insight which could be drawn. So if machine learning models are secondary and the data available to us is not of much use, how are we supposed to solve problems? Pausing for effect….

By looking for untapped, original sources of data that are very likely to represent the truth. Neither you will need to pull one to find these datasets nor the size of your dataset is of much importance. All that matters is that you find the right data which would help you solve the problem at hand or test your hypothesis.

Now the only reason I am presenting this opinion is that I have myself come along the path which I described as inefficient and have learned along the way. Yay, look at me be a data scientist :) but let me present a real-life example, which explains the story of every CS college student with a Robinhood account, to give more clarity. Daily price data of stocks are available on the internet and easily accessible so it is a common approach to get that data, build technical indicators using it and then use those indicators as features to predict the future price of the underlying stock. Now immediate improvement to the model can be interacting those indicator variables to combine price volume action or define an easier to predict y variable such as the direction of price rather than the future price which involves predicting both the direction and magnitude of the stock. Even after all these improvements and testing of each hyper-parameter possible, the model accuracy will rarely exceed that of the base class. So how should one go about solving this problem? Well, in the early 2000s, the answer was to look at other data such as option open interest and foreign institutions buying and selling to see what the smart money is doing. Post the financial crash, the trick became to get data from dark pools to get a better idea of the order flow. In 2020, some say looking at where all the money from all the Robinhood accounts is going might entangle a signal because most retail traders act on tips and are very responsive to news but the wave is strong enough to create enough one-sided (normally buying) pressure to influence the price of the stock. This is why today’s hedge funds spend a good portion of their budget on procuring quality data. I am not sure if the Robinhood approach has merit but I think I’ve made my point — every time solving the problem required one to think of an original dataset which would represent new kind of information.

Why is this the road less taken? Because it requires hours to look for data and then some more to scalp, clean, and prepare the data before you can perform any kind of analysis. Not to say that these tasks are boring and might have to be done a few times if we’re unhappy with the data. Hence, we normally take the high road and try to fit all possible models and focus on hyperparameter tuning as a substitute.

PS: The goal of this post was not to belittle model building or deep learning. They’re making great strides in their areas of application such as handwriting recognition, image classification, and speech detection. My concern was more with solving business-related problems which is the job of a majority of data scientists.

--

--