One way to get a look at the emerging definition of data science is to search for "data science" jobs, and see what they have in common: What are the key skills for data scientists?
|This wordle sounds like Romney: "jobs jobs jobs..."|
I took a few minutes today to run that search -- automated, of course. Nothing rigorous or scientific, but the results are still plenty interesting.
Methods: I searched "data scientist" on simplyhired.com, then scraped the ~250 resulting links. All of the non-dead links (there weren't many dead ones) returned a job posting, usually on a company page, occasionally on another aggregator. I grabbed the html of each of these pages, and cleaned the html to get rid of scripts, styles, etc. I didn't do any fancy chrome or ad scraping, so take the results with a grain of salt.
First, I generated the obligatory word cloud. Thank you, wordle.
Then I skimmed a dozen of the pages, looking for keywords that seemed to pop up a lot: java, hadoop, python. For the most part, I focused on specific skills that companies are explicitly hiring for. I also tossed in a few other terms, just to see what would happen.
Here are counts of jobs mentioning keywords (Not keyword counts -- the number of separate job postings that include at least one reference to a given keyword.):
Evidently, data science postings put equal value on "fun" and "jquery."
Also, at a glance, Java beats python, beats C++ in terms of employability. It kind of makes me wish I'd been nicer to Java all these years.
One clear finding is that hadoop and MapReduce skills are in high demand. That's not news to anyone working in this area, but I was surprised at just how many jobs were looking for these skills. Almost half (107 of 238, 45%) of total job postings explicitly mention hadoop.
That percentage seems slightly out of whack to me, because there are plenty of valuable ways to mine data without using a MapReduce algorithm. Maybe Hadoop is a pinch point in the job market because there just aren't enough MapReduce-literate data miners out there? If that's the case, I would expect demand to come down (relative to supply) in the not-too-distant future -- MapReduce isn't that hard to learn.
Alternatively, there could be some bandwaggoning going on:
"Data is the Next Big Thing. We need to hire a data person."
"What exactly is a data person?"
"I don't know, but I hear they know how to program in 'ha-doop,' so put that in the posting."As a third explanation, it may be that the meaning of "data science" is narrowing. Instead of encompassing all the things that can be done with data, perhaps it's coming to mean "mapreduce." If that's the case, then "data science" jobs would naturally include hadoop/mapreduce skills. IMO, that would be sad, because it would be an opportunity missed to change the way data flows into decisions in a more systemic way.
I'd be interested in hearing other explanations for the dominance of hadoop. Also, if you have other queries to run against this data set, I'm happy to try them out. What I've put up so far is just back-of-envelope coffee-break stuff.