Thursday, May 3, 2012

Running mapreduce on Amazon's publicly available datasets with python

on Monday, I had a preliminary interview at a really interesting tech startup. In the course of the conversation, the interviewer mentioned that he'd used some of the technical notes from compSocSci in his own work.  And I thought nobody was reading!

Anyway, I've been sitting on some old EC2/hadoop/python notes for a while.  The talk gave me the motivation to clean up and post them, just in case they can help somebody else.  The goal here is threefold:

  1. Fire up a hadoop cluster on EC2
  2. Import data from an EBS volume with one of AWS' public data sets
  3. Use hadoop streaming and python for quick scripting

In other words, we want to set up a tidy, scalable data pipeline as fast as possible.  My target project is to do word counts on wikipedia pages -- the classic "hello world" of mapReduce.  This isn't super-hard, but I haven't seen a good soup-to-nuts guide that brings all of these things together.

Phase 1:
Follow the notes below to get to the digits-of-pi test.  Except for a little trouble with AWS keys, this all went swimmingly, so I see no need to duplicate.  If you run into trouble with this part, we can troubleshoot in the comments.

Phase 2:
Now let's attach an external dataset.  Here's the dataset we'll use:  Wikipedia Extraction (WEX).  It's a processed dump of the English language Wikipedia, hosted publicly on Amazon Web Services under snapshot ID snap-­1781757e.

This dataset contains a dump of 1,000 popular English wikipedia articles.  It's about 70GB.  At Amazon's $.12/GB rate, maintaining this volume costs about $8 for a whole month -- cheap!  If you want to scale up to full-size wikipedia (~500GB), you can do that too.  After all, we're in big data land.

Here's the command sequence to create an EBS volume for this snapshot and attach it to an instance. You can look up the ids using ec2-describe-volumes and ec2-describe-instances, or get them from the AWS console at  (Hint: they're not vol-aaaaaaaa and i-bbbbbbbbb.)

    ec2-create-volume -snapshot snap-1781757e -z us-east-1a
    ec2-attach-volume vol-aaaaaaaa -i i-bbbbbbbb -d /dev/sdf

It took a while for these commands to execute.  Attaching the volume got stuck in "attaching" status for several minutes.  I finally got tired of waiting and mounted the volume, and then the status switched right away.  Can't say whether that was cause-and-effect or coincidence, but it worked.

Once you've attached the EBS volume, login to the instance (instructions here) and mount the volume as follows. This should be pretty much instantaneous.

    mkdir /mnt/wex_data
    mount /dev/sdf /mnt/wex_data

Now import the data into the Hadoop file system:

    cd /usr/local/hadoop/
    hadoop fs -copyFromLocal /mnt/wex_data/rawd/freebase-wex-2009-01-12-articles.tsv wex-data

If you want, you can now remove and delete the EBS volume.  The articles file is stored in the distributed filesystem across  your EC2 instances in  you hadoop cluster.  The nice thing is that you can get to this point within less than an hour, meaning that you only have to pay a tiny fraction of the monthly storage cost.

    ec2-detach-volume vol-aaaaaaaa -i i-bbbbbbbbb -d /dev/sdf
    ec2-delete-volume vol-aaaaaaaa

I had some trouble detaching volumes until I used the force flag: -f.  Maybe I was just being impatient again.

That's enough for the moment.  I'll tackle python in my next post.

No comments:

Post a Comment