Much has been said about the dire shortage of Data Scientists looming on the horizon. With the spectre of Big Data casting shadows over every domain, it would seem we need nothing short of a caped wonder to help us see the light. Heralded as superheroes, Data Scientists will swoop into an organization and free the Lois Lane of latent knowledge from the cold clutches of Big Data. In the end the enterprise bystanders will marvel at the amazing powers these superhumans possess. Everyone will be happy and the Data Scientist will get the girl.
It’s a great story and a great time to be a nerd. As much as I want to believe in this story, I just don’t buy it. True there is more data being produced now than ever before. The rate of data production is growing exponentially and people need to be able to analyze this data. Yet this dire need feels manufactured. The promoters of Data Science point to the McKinsey study that cites a “shortage of 140,000 – 190,000 people with deep analytical skills” by 2018. That’s a lot of Data Scientists! Some people claim that every organization will eventually need at least one Data Scientist and perhaps even have their own department. This all sounds fantastic (who wouldn’t want a legion of super-nerds be a force in culture?) except there are some serious problems with this analysis. There are three significant problems with the hyperbole surrounding Data Science: selection bias, assimilation blindness, and automation blindness. What we’ll see is that the need for Data Scientists is likely smaller than advertised with a startlingly short half-life.
Selection Bias
The first problem is that people assume that all 150k “people with deep analytical skills” are all Data Scientists. First let’s look at the math. Suppose every organization does need at least one Data Scientist. We start with the number of public companies listed on major exchanges in the US as a proxy for “every organization”, which is about 5000. Why is this a reasonable proxy? Because smaller companies probably don’t have the budget to support full time data scientists. Adding businesses listed in OTC markets, we can roughly double that number. Fine, so let’s say 10000 companies. Then on average that would mean each organization has a team of 15 Data Scientists. Wow, I see a lot of dollar signs piling up alongside the map-reduce queries.
Clearly there must be other professions that require analytical skills that aren’t Data Scientists. Look at the cross section of people that use R and you’ll see people in Psychology, Economics, Biology, Finance, etc. The biggest population by far is the traditional group you think of when you think analysis: engineering. McKinsey hints at this when they list the Internet of Things as being one of the sources for the exponential growth in data. This version of the future, popularized by GE, points to Computational Engineers as filling most of this population. When GE alone is hiring 400 people to fill one development center, it’s plausible that the net shortage could reach hundreds of thousands.
Assimilation Blindness
The next problem is what I call assimilation blindness. Even if a shortage of this scale did exist for Data Scientists, it wouldn’t be sustained. As understanding of Big Data and analytical methods becomes more widespread, the need for specialists will often diminish. A good example is how web developers used to be a prized resource but are now commoditized since even High School students can build web sites (or iPhone apps for that matter). Data Scientists will find that their role will be assimilated quickly since their role only differs from traditional roles by having a big data component. What is the role of a Data Scientist? It is still up for debate, but here are some of the most popular themes I’ve seen:
- Telling stories with data (including visualization) – This is what marketers do. As tools become easier to use and analytical methods more pervasive, presumably many people in the marketing department will know how to take advantage of these tools directly rather than relying on a Data Scientist
- Finding insights in data – This is what business analysts do. They’ve been trained to use analytical tools for years and know how to spot interesting phenomena in data. The tool set is different as is the scale of the data, but given most business analysts know a little SQL and basic statistics, it isn’t a stretch to conclude that they would assimilate many of the functions a Data Scientist fills
- Creating products from data – This is what product managers do. In finance there are plenty of data products, and they aren’t managed or invented by Data Scientists. As data products become more mainstream, more people in the product management arena will know how to ask questions of data directly because they will have learned these skills themselves
Hence while there may be a shortage in the short term, over time the Data Scientist will lose his cape and disappear into the crowd.
Automation Blindness
The functions of a Data Scientist that aren’t assimilated will likely be automated away. Not recognizing this phenomenon is what I call automation blindness. Numerous startups and big players such as IBM are developing tools to simplify big data analysis. Currently a big portion of a Data Scientist’s role is bringing together data from disparate sources to make an analysis possible. Once this is automated, the need for specialists will again decline.
In short the shortage of Data Scientists is shrouded in the myths of storytellers. There is definitely a need for people with analytical skills, and we will see this separate into skills that are generally assimilated and advanced skills used by engineers to design tools and systems that rely on data for their proper function.
Hi Brian –
I agree with your points and am interested in the marketing use of big data, “telling stories.”
As marketers, we need visibility through the business analysts or directly, to “spot interesting phenomena in data.” The trends in our customer base should inform our storytelling.
As a marketer in the trenches with customer facing campaigns, it is inspiring to get insights about how to improve the customer experience which will inherently make our campaigns more effective.
I’d be curious to hear what others are doing to gain this transparency within their organizations.
Best Regards,
Stacy Ries
LikeLike
Stacy, Thanks for your response. With software becoming so pervasive, I think A/B testing and other real-time hypothesis testing is really going to change the way people look at the product development process. I’m excited to see how marketing and product development evolve in response to the maturation of these tools and methods. Cheers, Brian
LikeLike
The principle issue with Big Data is its obviousness. The only reason to incur Big Data (other than the fact that the raw stream is Big) is to find a needle in a haystack. Where the stream is Big, just filter. The analysis follows standard methods. When you’re looking for inference from the needle, then either the needle per se must be awfully valuable, or the needle has some other attribute which makes it valuable.
This meme is warmed over Long Tail, which Amazon has demonstrated, not intentionally, to be a colossal waste of money. One makes money on the majority, not the infinitesimal (pharma being a growing exception). There really is no point to spending a $1 million on data science in order to shift one more widget. Unless, of course, it’s a $2 million widget. But if you set out to design and build a $2 million widget without first knowing how to sell it, well …
— As tools become easier to use and analytical methods more pervasive, presumably many people in the marketing department will know how to take advantage of these tools directly rather than relying on a Data Scientist
This happened years ago with 1-2-3 (Excel), and all manner of lousy accounting/analysis ensued. Culminating in The Great Recession. The Clueless wielding a chain saw is never good policy. Relational database folks have been battling with file level coders for decades; file coders think they know enough to define a schema as a flatfile image, and that such is sufficient. Operations research and applied stats have similar issues.
LikeLike
What’s interesting about the Tail, too, is that almost by definition, the data available on extremes is scarce. That means, at least, that regressions based upon it are poorly balanced. It means that if these are to be interpreted, the individuals must be classed with populations and their behaviors which are much larger. What’s that classification based upon? Upon what they all have in common.
Some companies assess how well they are doing with these kinds of inferences by how much additional money they make. Compared to what, I say? Is it sustainable? Do they understand why it worked this time? If not, how do they know it is repeatable?
Finally, even if one had the Magic Machine which figured out how to sell to each individual on the Tail a Quirky Unique Thing, the price needs to be low or the cost per transaction needs to be low. What’s the marginal return of that process compared to, say, simply driving more traffic to the site? Does it justify the Hadoop-ish special gizmos needed to process these things? Is it better than traditional data warehousing technology (the “relational database folks” of Robert Young above), which is more or less needed anyway?
I don’t think I’d invest in a company betting on this kind of stuff.
LikeLike
While I agree that visualization and basic stats will become commoditized in the years to come, I don’t know about the spotting interesting phenomena part. Your analogy of a web programmer is only partially apt, since web programming basically just requires a knowledge of HTML, which is not that hard to acquire. By contrast, not everyone has the background or ability to understand the nuances of data mining algorithms. How do you know which kernel function to choose for a support vector machine? How, for that matter, do you know when a support vector machine is a better choice as a classifier than a random forest of logistic regression model? How do you know which data transformations make the most sense for utilizing a given algorithm?
These questions can be automated to a certain extent (SAS automatically transforming your data for a linear model, for example), but ultimately the differences between various techniques and knowing where they are strong and where they are weak depend a great deal on understanding the underlying mathematics and comp sci, which a computer can’t do for you. The tradeoffs between various analytical approaches and algorithmic choices is often a business decision, since the results will have real implications in how your models perform in the real world, and tying strategic decision making to analytical approach is not something a computer nor an untrained (in data science) human can do effectively.
I think your math on demand for data scientists is a little presumptive as well. You’re assuming a uniform distribution of data scientists among large firms, which is patently false. There are almost certainly many firms which need 1, 2, or perhaps no data scientists. But there are also firms that need hundreds (consultancies, very large consumer products firms like P&G or Unilever, web 2.0 firms like Google, etc) who more than take up the slack in demand. I do however think that the term ‘data scientist’ will get diluted as more people strive to learn elements of data science across fields, and in a few years the typical industrial data scientist will not have a PhD, if they even do now. More university programs specific to industrial data science will also spring up (most likely at the graduate but not PhD level), which is already happening and will further serve to increase the supply side of the data science market.
LikeLike
Tom, I agree there will always be a need for experts. My point is that much of the current demand will be reduced in the future as assimilation and automation have their respective effects. I think the uptake will be faster than people realize, hence the point of the title. Cheers, Brian
LikeLike
Tom, I find your post self-illustrating, especially “… ultimately the differences between various techniques and knowing where they are strong and where they are weak depend a great deal on understanding the underlying mathematics and comp sci, which a computer can’t do for you. The tradeoffs between various analytical approaches and algorithmic choices is often a business decision …”. That algorithm oriented and business oriented reveals the problem … Where’s an assessment of the QUALITY and nature of the DATA? I find this common in much of “data mining” work, accepting that the data is what it is, and it inherently represents “truth”. That “big data” cannot be criticized by comparison and control is a major problem with it.
Don’t care how good an algorithm is, if it’s mismatched for the data it won’t be good, and if the data don’t represent what you want them to represent, there’s no way of pretending or magically converting them to do so.
LikeLike
That’s certainly true, and like most people who work with data for a living I spend the majority of my time (probably 80%) trying to get the right data, verify its source and true meaning, and understand how it relates to other data points within my organization. Data quality and meaning is a huge issue, and another one that I consider to difficult to automate. I didn’t mention it simply because it wasn’t really addressed in the initial post, but it’s a core problem I deal with every day at the intersection of IT, operations, and modeling.
LikeLike
— Your analogy of a web programmer is only partially apt, since web programming basically just requires a knowledge of HTML, which is not that hard to acquire.
While I have little regard for web programming, from dismal experience, real web programming involves knowing a good deal about server side coding and javascript, very ugh. HTML hasn’t been the main relevance to web sites since static page tossing went out of fashion a decade ago.
LikeLike
Brian, i definitely hear you about data science becoming spread out across different areas. But isn’t this the case for almost any widely used method? In the end the generalist data scientists kind of go extinct just to be replaced by either super specialized data scientists (focusing on a very narrow niche) or people who combine general data science knowledge with a narrow content expertise or, a less common alternative, those who can talk across content and method areas.
But doesn’t that then support the idea of an even further increase in data scientists? Sure, they will be different from what today we know as data scientists, but not sure their being different makes the predictions necessarily false. For example, if by the end of the XIX century somebody were to predict a steep rise in the need for transportation-related industry, does that mean that it would have to be horses?
LikeLike
Hello, I agree with the first part of what you said, which is the assimilation portion of my argument. Your conclusion is different from mine though. If you look at the field of statistics it has also been heavily assimilated by other fields. Medical Doctors, Physicists, Economists, Engineers all heavily use statistics in their work but we wouldn’t call them Statisticians. Though they have adopted tools from the field of statistics their profession is unchanged. I think this assimilation will be even more widespread with so-called data science (which is kind of a repackaging of statistics plus some tools for big data) meaning that the demand for specialized Data Scientists will peak and fade quickly. I also don’t think the situation is as dire as Robert Young puts it. Sure there will be the ones that don’t know really know what they are doing just like there are a lot of MBAs that don’t really understand CAPM or Black Scholes, but that doesn’t mean people who aren’t experts are dilettantes. Again look at statistics as a counter example.
LikeLike
@Brian, can’t speak to Physicists or Economists, but in the case of Medical Doctors and Engineers, the uptake from Statistics is quite uneven, some might even say poor. For Medical Doctors, check out Prof Frank Harrell’s “Information Allergy” lecture at http://mediasite.vanderbilt.edu/mediasite/SilverlightPlayer/Default.aspx?peid=44fae65814f2435fbe2e75bad3ec8e9d and one of his lectures at http://biostat.mc.vanderbilt.edu/wiki/pub/Main/ClinStat/lecture1.html. Medical researchers may have a better take. As far as engineers go, as I am one, most seem to be frozen in the state they last took statistics, and that’s often in an introductory or theory of statistics presentation. Some, of course, encountered more as reliability or implicitly in signal processing, but these are stuck in early-mid 20th century ideas. Check out, for instance, the IEEE’s STATISTICS FOR ENGINEERING PROBLEM SOLVING, an otherwise excellent text by Vardeman, or even Elsayed’s RELIABILITY ENGINEERING. Neither mentions a thing about Bayesian approaches, and doesn’t even suggest resampling methods (the bootstrap). Still, the IEEE’s Transactions on, say, Signal Processing has many papers which are at the forefront of modern statistical methods. It just hasn’t percolated to the ranks.
This is unfortunate, as these fields apparently don’t make use of the wonderful advances in simulation-based inference we’ve had, especially the Bayesian revolution with Gibbs samplers and MCMC. Professor Harrell vigorously argues for the use of resampling as a means of validation and other purposes, and tries to introduce Bayesian ideas in his lectures.
In contrast, biologists seem to get it. Many geophysicists get and use it. I say that, not because I’m either, but because I learn many things from their papers.
So, sure, “assimilation” will happen, but it may occur on the order of generations, as people graduate with new ideas from school, not years.
LikeLike
Pingback: O Mito do Cientista de Dados Perdido | Mineração de Dados