Monday, May 27, 2013

Crocodile hunters versus zoo keepers: Aspiring data scientists should speak python

We're hiring data scientists at Jawbone, which means I've been spending a lot of time reviewing resumes and interviewing candidates.  Making the "bring him/her in" or "pass" decision dozens of times an hour has been a great opportunity to focus on the features that make great data science candidates stand out from the crowd.

What I've learned: Python is a great differentiator for job-hunting data scientists.  In the mix of resumes I review, it's the single biggest thing I look for.  Putting python high on your resume says, "Not only do I grok statistics/machine learning/R/SQL/Matlab/Octave/d3/etc, I can build the pipes and plumbing that make data systems work."

When I see resumes with strong background in practical programming, I think "Crocodile Hunter."  When those things are missing, I think "zoo keeper."  Zoo keepers aren't bad, but I find they need hand-holding every time you go into the jungle to wrangle wild data.

For whatever reason*, I find python sends a stronger signal of these skills than Java, C++, or any other mainstream programming language.  Not having python on your resume is a big handicap in the review process -- not an absolute deal-breaker, but in practice we hardly ever end up interviewing candidates without at least basic python skills.

With that said, "python" is an awfully big field to survey.  What python libraries are the most useful, really?  I've been asked this question several times recently.  Here's what I said in a recent email:
For python, I'd highly recommend installing ipython*, and getting your feet wet with the pandas library.  The matplotlibjson, and requests libraries are also good places to know your way around.  numpy and scipy have good stuff in them, but they're so huge that it's hard to really "know" them.  boto is also worth knowing, but it's only useful if you have a subscription to Amazon Web Services (AWS). 
For hadoop, I'd look specifically at mrJob.  It's a python-based wrapper that makes hadoop easier to use.  mrJob is too new to be covered in books yet, but the online documentation is pretty good.  You can use it in test mode even without installing hadoop.
Disclaimer: I know know that talking up one language over another is one of the cardinal sins of code. I don't want to start a religious war here -- I'm not saying that python is better than C++, or that pandas is the only way to manipulate data in python (although I believe Wes McKinney deserves a platinum-plated "better than sliced bread" award) -- I'm just saying that in the rough-and-tumble of the hiring process, data scientists with real python skills have a big advantage. 

Aspiring data scientists, you have been warned.

Practicing data scientists, what would you add to this list?  I've plugged in my favorite modules, but I'm sure there are others also worth a mention.

*I freely admit that the reason may just be in my head.  However, this preference for python also seems to be in a lot of other people's heads as well.  Whether there's a real reason for it or it's just an cognitive bias, it's working as a gating factor for a lot of resumes.

1 comment:

  1. Hadoop Development gets a lot of buzz these days in database and content management circles.