Monday, May 27, 2013

Crocodile hunters versus zoo keepers: Aspiring data scientists should speak python


We're hiring data scientists at Jawbone, which means I've been spending a lot of time reviewing resumes and interviewing candidates.  Making the "bring him/her in" or "pass" decision dozens of times an hour has been a great opportunity to focus on the features that make great data science candidates stand out from the crowd.




What I've learned: Python is a great differentiator for job-hunting data scientists.  In the mix of resumes I review, it's the single biggest thing I look for.  Putting python high on your resume says, "Not only do I grok statistics/machine learning/R/SQL/Matlab/Octave/d3/etc, I can build the pipes and plumbing that make data systems work."

When I see resumes with strong background in practical programming, I think "Crocodile Hunter."  When those things are missing, I think "zoo keeper."  Zoo keepers aren't bad, but I find they need hand-holding every time you go into the jungle to wrangle wild data.

For whatever reason*, I find python sends a stronger signal of these skills than Java, C++, or any other mainstream programming language.  Not having python on your resume is a big handicap in the review process -- not an absolute deal-breaker, but in practice we hardly ever end up interviewing candidates without at least basic python skills.

With that said, "python" is an awfully big field to survey.  What python libraries are the most useful, really?  I've been asked this question several times recently.  Here's what I said in a recent email:
For python, I'd highly recommend installing ipython*, and getting your feet wet with the pandas library.  The matplotlibjson, and requests libraries are also good places to know your way around.  numpy and scipy have good stuff in them, but they're so huge that it's hard to really "know" them.  boto is also worth knowing, but it's only useful if you have a subscription to Amazon Web Services (AWS). 
For hadoop, I'd look specifically at mrJob.  It's a python-based wrapper that makes hadoop easier to use.  mrJob is too new to be covered in books yet, but the online documentation is pretty good.  You can use it in test mode even without installing hadoop.
Disclaimer: I know know that talking up one language over another is one of the cardinal sins of code. I don't want to start a religious war here -- I'm not saying that python is better than C++, or that pandas is the only way to manipulate data in python (although I believe Wes McKinney deserves a platinum-plated "better than sliced bread" award) -- I'm just saying that in the rough-and-tumble of the hiring process, data scientists with real python skills have a big advantage. 

Aspiring data scientists, you have been warned.

Practicing data scientists, what would you add to this list?  I've plugged in my favorite modules, but I'm sure there are others also worth a mention.


*I freely admit that the reason may just be in my head.  However, this preference for python also seems to be in a lot of other people's heads as well.  Whether there's a real reason for it or it's just an cognitive bias, it's working as a gating factor for a lot of resumes.

Monday, May 13, 2013

Simple ways to make prettier graphs

A question on graphs from my cousin. She's good with statistics, but not a programmer.  I've fielded similar questions many times, so figured it'd be worth putting the answer in the public domain.


Any graph that includes the caption "lunch" is a good graph.  Also "nap."
Quick question-- I'm writing up a paper right now and need to stick some simple graphs in. Do you have any suggestions ways to make graphs that are prettier than Excel to Word (low bar...ha ha! Accidental pun!)?
My response:

Love the pun. :)  [Miscellaneous personal stuff...]
On graphs: How many graphs are we talking?  If it's just a handful, or if they're all different kinds, I'd recommend Photoshop or Illustrator.  Import the graph from excel, and then "trace over" it to give it the styling you'd like.  A lot of great data-centric presentations use this trick. 
Another option is tableau.  It's a pricey, but gives you good tools for designing nice-looking graphs, as well as tools for automating them (i.e. generating 20 graphs with the same basic template.) You might be able to use a 30-day trial; and maybe their student licenses are cheaper than the corporate ones. 
If you don't want to shell out for tableau and you're doing *lots* of graphs in the same style, then it might be worth climbing the learning curve for matplotlibggplot2, or the google charts API.  I doubt this is worth your time because, there'd be such a long learning curve: each of these is a graphing library on top of a programming language, and you'd need facility with both to make them work.
Getting graphs right is fiddly work.  All those axis and labels and spaces to play with, and that's before you even begin to think about borders and color.

I've toyed with the idea of a declarative language for graphs: a syntax for describing the story you want told, without including all the execution details.  For example, "A is more Y than B" should give you a nice bar chart with a tall column labeled "A," and a shorter column labeled "B."  The y-axis should be labeled "Y."

This strikes me as a difficult, but maybe not an impossible challenge...

Monday, May 6, 2013

Sadly, I'm not developing tengolo. Would you like to run with it?

A little over a year ago, I announced that I was working on tengolo, a python alternative to netlogo.  I'm not actively developing it -- didn't get much farther than a rough proof-of-concept, really -- but I still get questions about the package.

As a result, I find myself writing some variation on the email below at least a couple times a month:

Dear python/ABM enthusiast - 
Glad you're interested in python and ABMs.  I started on tengolo after a thorough search turned up no good ABM frameworks in python.  I worked on it for a short while, then moved on when my dissertation committee told me to focus on stuff that would actually help me graduate. :) 
I got far enough in to be confident that a python-based ABM framework like tengolo could work. All the code is in the github repository, and every month I get questions from people asking if it's being actively developed. There's clearly demand for the project, but I don't have time to support it at this point.  I'd love to see someone take this ball and run with it. 
Best,
Abe

Would you like to run with this project?  If you're good with python and want to run a potentially popular academic open-source project, tengolo would be a great fit.  Please get in touch, and I will happily direct potential users and collaborators in your direction.