Wednesday, February 8, 2012

Key skills for job-hunting data scientists

There's a lot of buzz around data science, but (as I've posted about previously) the term is still murky.

One way to get a look at the emerging definition of data science is to search for "data science" jobs, and see what they have in common: What are the key skills for data scientists?

This wordle sounds like Romney: "jobs jobs jobs..."

I took a few minutes today to run that search -- automated, of course.  Nothing rigorous or scientific, but the results are still plenty interesting.

Methods:  I searched "data scientist" on, then scraped the ~250 resulting links.  All of the non-dead links (there weren't many dead ones) returned a job posting, usually on a company page, occasionally on another aggregator.  I grabbed the html of each of these pages, and cleaned the html to get rid of scripts, styles, etc.  I didn't do any fancy chrome or ad scraping, so take the results with a grain of salt.

First, I generated the obligatory word cloud.  Thank you, wordle.

Then I skimmed a dozen of the pages, looking for keywords that seemed to pop up a lot: java, hadoop, python.  For the most part, I focused on specific skills that companies are explicitly hiring for.  I also tossed in a few other terms, just to see what would happen.

Here are counts of jobs mentioning keywords (Not keyword counts -- the number of separate job postings that include at least one reference to a given keyword.):

198 data
131 statist
130 java
107 hadoop
85 mining
68 python
49 visuali
39 cloud
37 mapreduce
35 c\+\+
24 amazon
22 ruby
18 bayes
15 ec2
13 jquery
13 fun
3 estimat

Evidently, data science postings put equal value on "fun" and "jquery."

Also, at a glance, Java beats python, beats C++ in terms of employability.  It kind of makes me wish I'd been nicer to Java all these years.

One clear finding is that hadoop and MapReduce skills are in high demand.  That's not news to anyone working in this area, but I was surprised at just how many jobs were looking for these skills.  Almost half (107 of 238, 45%) of total job postings explicitly mention hadoop.

That percentage seems slightly out of whack to me, because there are plenty of valuable ways to mine data without using a MapReduce algorithm.  Maybe Hadoop is a pinch point in the job market because there just aren't enough MapReduce-literate data miners out there?  If that's the case, I would expect demand to come down (relative to supply) in the not-too-distant future -- MapReduce isn't that hard to learn.

Alternatively, there could be some bandwaggoning going on:
"Data is the Next Big Thing.  We need to hire a data person."
"What exactly is a data person?"
"I don't know, but I hear they know how to program in 'ha-doop,' so put that in the posting."
As a third explanation, it may be that the meaning of "data science" is narrowing.  Instead of encompassing all the things that can be done with data, perhaps it's coming to mean "mapreduce."  If that's the case, then "data science" jobs would naturally include hadoop/mapreduce skills.  IMO, that would be sad, because it would be an opportunity missed to change the way data flows into decisions in a more systemic way.

I'd be interested in hearing other explanations for the dominance of hadoop.  Also, if you have other queries to run against this data set, I'm happy to try them out.  What I've put up so far is just back-of-envelope coffee-break stuff.


  1. jobs data mining

    If you have passion for and comprehension of big data, there are countless, diverse opportunities out there awaiting you.

  2. Over the years he has programmed in several different languages, to include: Basic, Qbasic, Logo, Turbo Pascal, Visual Basic, Java, JavaScript, jQuery, PHP, HTML, XHTML, CSS, XML, Flash, & Transact-SQL.

    jquery jobs

  3. Great Content.I have appreciate with getting lot of good and reliable information with your post.......
    Thanks for sharing such kind of nice and wonderful information......again, beautiful :) I love reading your posts. They make me happy .
    subliminal advertising

  4. Really useful information. we are providing best data science online training from industry experts.

  5. When you unit of measurement on the brink of have interaction on a period of time project than you need to need to re assisting the large data f your company sort making a replacement project strategy for the end of the day work, so this is this can be often very vast disadvantage to scan and understand the large data of your company, there is the only Activewizards blog for resolution this disadvantage is to rent the data person for this and you may be ready to get them sort this internet site.

  6. The blog is so interactive and Informative , i Request you to write more blogs like this Hadoop Admin Online Training

  7. I appreciate this article. it’s a very informative.
    Data Science Online Training

  8. Cool article it's really. Friend on mine has long been awaiting just for this content. data science from scratch