compSocSci

Monday, September 23, 2013

Change of address: blog.abegong.com

I've moved! Since finishing grad school, I've decided to fold my personal web page and blog together. In the future, I'll be posting to http://blog.abegong.com.

Also, my posts will focus on "data science" instead of "computational social science." They're pretty much the same thing, but "data science" seems to be the phrase that's catching on.

See you there!

Wednesday, July 10, 2013

Speed up hadoop development with progressive testing

Debugging Hadoop jobs can be a huge pain. The cycle time is slow, and error messages are often uninformative --- especially if you're using Hadoop streaming, or working on EMR.

I once found myself trying to debug a job that took a full six hours to fail. It took more than a week -- a whole week! -- to find and fix the problem. Of course, I was doing other things at the same time, but the need to constantly check up on the status of the job was a huge drain on my energy and productivity. It was a Very Bad Week.

Painful experiences like this have taught me to follow a test-driven approach to hadoop development. Whenever I'm working on a new hadoop-based data pipe, my goal is to isolate six distinct kinds of problems that arise in hadoop development.

Explore the data: The pipe must accept data from a given format, which might not be fully understood at the outset.
Test basic logic: The pipe must execute the intended data transformation for "normal" data.
Test edge cases: The pipe must deal gracefully with edge cases, missing or misformatted fields, rare divide-by-zeroes, etc.
Test deployment parameters: The pipe must be deployable on hadoop, with all the right filenames, code dependencies, and permissions.
Test cluster performance: For big enough jobs, the pipe must run efficiently. If not, we need to tune or scale up the cluster.
Test scheduling parameters: Once pipes are built, routine jobs must be scheduled and executed.

Each of these steps requires different test data and different methods for trapping and diagnosing errors. Therefore, the goal is to make sure to (1) tackle problems one at a time, and (2) solve each kind of problem in the environment with the fastest cycle time.

Steps 1 through 3 should be solved locally, using progressively larger data sets. Steps 4 and 5 must be run remotely, again using progressively larger data sets.

Step 6 depends on your scheduling system and has a very slow cycle time (i.e. you must wait a day to test whether your daily jobs run on the proper schedule.). However, it's independent of hadoop, so you can build, test, and deploy it separately. (There may be some crossover with #4, but you can test this with small data sets.)

Going through six different rounds of testing may seem like overkill, but in my experience it's absolutely worth it. Very likely, you'll encounter at least one new bug/mistake/unanticipated case at each stage. Progressive testing ensures that each bug is dealt with as quickly as possible, and prevents them from ganging up on you.

Other suggestions:

Definitely use an abstraction layer that allows you to seamlessly deploy local code to your staging and production clusters. Cascalog and mrJob are good examples. Otherwise, you'll find yourself solving steps 2 and 3 all over again in deployment.

Config files and object-oriented code can reduce a lot of headaches in step 4. Most of your deployment hooks can be written once and saved in a config file. If you have strong naming conventions, then most of your filenames can be constructed (and tested) programmatically. It's amazing how many hours you can waste debugging a simple typo in hadoop. Good OOP will spare you many of these headaches.

Part of the beauty of Hive and HBase is that they abstract away most of the potential pitfalls on the deployment side, especially in step 4. By the same token, tools like Azkaban and Oozie can take a lot of the pain out of step 6. (Be careful, though -- each of these scheduling tools has its limitations.)

Monday, June 17, 2013

Scientists leaving the academy: Pushed, or pulled?

Several of my friends have shared and commented on this article in the Chronicle of Higher Education: "On Leaving Academe." The author is Terran Lane, a former computer science professor at the University of New Mexico.

The article starts with the (shocking!) revelation that he is leaving his position as a professor to work for Google. Lane then lists nine reasons for leaving:

Making a difference
Work-life imbalance
Centralization of authority and decrease of autonomy
Budget climate
Hyperspecialization, insularity, and narrowness of vision
Poor incentives
Mass production of education
Salaries
Anti-intellectualism, anti-education, and attacks on science and academe

The tone is of the article is very negative. Lane frames most of his complaints as forces that are pushing him out of the University. Honestly, it feels a little bit bitter.

As I've discussed this with friends, I've decided that I disagree with the tone, if not the reasons. I've also made a similar decision to -- temporarily, at least -- leave the academy for the private sector. But I see the whole experience in a much more positive light.

As I see it, there are growing incentives to find applications for science outside the academy. Since I've got into the startup world, I've met lots of psychologists, economists, and even the occasional political scientist who are building consumer-facing tools based on well-founded theories of social science.

To me, this feels like an emerging renaissance in applied social science. In other words, it's not just the case that smart, ambitious people are being pushed out of academia; they're being pulled out as well.

In the past, most careers paths allowed you to seek the truth OR change the world, but not both. I'm optimistic that the rising volume and value of data is going to give more scientifically-minded people the chance to have their cake and analyze it too. Eliminating artificial distinctions between "thinkers" and "doers" is good for society overall.