Thursday, May 31, 2012

Bay Area data science people, events

A quick favor: I'm headed out to Palo Alto for a family event in a couple weeks.  While I'm there, I'd love to meet people and find out more about the Bay Area data science scene.

Where should I go?  Who should I meet?

I'm free mainly on Monday the 11th through Wednesday the 13th, with some time on Tuesday evening here.

This picture is the first result for gImages: "going to the big city."  I like it.

Tuesday, May 29, 2012

Live streaming of Northeastern/Harvard/MIT workshop on computational social science @ IQSS, May 30-June 1

Tomorrow, IQSS is running a conference on computational social science.  I can't attend this year, but the conference organizers have kindly offered to livestream the sessions.  Here's the email from David Lazer.

Hi all,

Please note that we will be live streaming the workshop on computational social science (program below).  The url:

The Twitter hashtag is:  #compsocsci12.  We will monitor this hashtag during the workshops to enable remote Q&A.

If you would like to embed the stream in your website, use this code:

<iframe src="" width="640" height="360" style='border: 0px;'></iframe>

Please feel free to forward this e-mail on to interested parties, and if this has been forwarded to you, and you would like to be added to the list, please contact



Friday, May 18, 2012

Will crunch numbers for food

I don't like self-promotion.  Makes me feel greasy, if you know what I mean.  But graduation is looming, it's a boom year for big data, and there's no hiring pipeline from political science to fun tech jobs in tech. So I figure it's time to hang out my shingle as a data scientist.

Earlier this week, I bought the domain name and worked up a digital resume.  Like I said, I'm not a big self-promotion guru, so I'd be grateful for feedback (or job leads).

Thursday, May 17, 2012

Nifty tools for playing with words

Here are a bunch of sites I use to play with words -- whether brainstorming or trying to accomplish something specific with text analysis.

A rhyming dictionary. Helpfully splits up the word list by syllables, so you can finish that sonnet you've been working on.

Here's a nifty little site for generating portmanteaus (word splices): generates themed domain names, and checks to make sure they're unclaimed by URL squatters.

Online lorem generator.  Here's the same thing in python.

Markov text generation:

Permute words and letters.  This seems less useful to me... It gives all the combinations, not just the ones that make some kind of sense.

Lavarand used to do random haikus and corporate memos, but it looks like they've broken down.

Google ngrams on AWS public data sets.  These are combinations of words that commonly co-occur in English.

Yes, yes.  And then there's wordle.  Too pretty for the rest of us.

What else belong on this list?

Tuesday, May 15, 2012

Python mapreduce on EC2

Last week, I wrote about getting AWS public datasets onto an EC2 cluster, and then into HDFS for MapReduce.  Now let's get to hello world (or rather, countWords) with python scripts.

#!/usr/bin/env python
import sys, re
for line in sys.stdin:
    line = line.lower()
    words = line.split()
    #--- output tuples [word, 1] in tab-delimited format---
    for word in words: 
        print '%s\t%s' % (word, "1")

Here's the reducer script....

#!/usr/bin/env python
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # parse the input we got from
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
        count = int(count)
    except ValueError:
        word2count[word] = word2count[word]+count
        word2count[word] = count
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )

The command to execute all this in hadoop is a bit of a monster, mainly because of all the filepaths.  Note the usage of the -file parameter, which tells hadoop to load files for use in the -mapper and -reducer arguments. Also, I set -jobconf compression to false, because I didn't have a handy LZO decompresser installed.

bin/hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -input wex-data -output output/run9 -file /usr/local/hadoop-0.19.0/my_scripts/ -file /usr/local/hadoop-0.19.0/my_scripts/ -mapper -reducer -jobconf mapred.output.compress=false

NB: As I dug into this task, I discovered several pretty good python/hadoop-streaming tutorials online.  The scripts here were modified from:

Other sources:

Thursday, May 3, 2012

Running mapreduce on Amazon's publicly available datasets with python

on Monday, I had a preliminary interview at a really interesting tech startup. In the course of the conversation, the interviewer mentioned that he'd used some of the technical notes from compSocSci in his own work.  And I thought nobody was reading!

Anyway, I've been sitting on some old EC2/hadoop/python notes for a while.  The talk gave me the motivation to clean up and post them, just in case they can help somebody else.  The goal here is threefold:

  1. Fire up a hadoop cluster on EC2
  2. Import data from an EBS volume with one of AWS' public data sets
  3. Use hadoop streaming and python for quick scripting

In other words, we want to set up a tidy, scalable data pipeline as fast as possible.  My target project is to do word counts on wikipedia pages -- the classic "hello world" of mapReduce.  This isn't super-hard, but I haven't seen a good soup-to-nuts guide that brings all of these things together.

Phase 1:
Follow the notes below to get to the digits-of-pi test.  Except for a little trouble with AWS keys, this all went swimmingly, so I see no need to duplicate.  If you run into trouble with this part, we can troubleshoot in the comments.

Phase 2:
Now let's attach an external dataset.  Here's the dataset we'll use:  Wikipedia Extraction (WEX).  It's a processed dump of the English language Wikipedia, hosted publicly on Amazon Web Services under snapshot ID snap-­1781757e.

This dataset contains a dump of 1,000 popular English wikipedia articles.  It's about 70GB.  At Amazon's $.12/GB rate, maintaining this volume costs about $8 for a whole month -- cheap!  If you want to scale up to full-size wikipedia (~500GB), you can do that too.  After all, we're in big data land.

Here's the command sequence to create an EBS volume for this snapshot and attach it to an instance. You can look up the ids using ec2-describe-volumes and ec2-describe-instances, or get them from the AWS console at  (Hint: they're not vol-aaaaaaaa and i-bbbbbbbbb.)

    ec2-create-volume -snapshot snap-1781757e -z us-east-1a
    ec2-attach-volume vol-aaaaaaaa -i i-bbbbbbbb -d /dev/sdf

It took a while for these commands to execute.  Attaching the volume got stuck in "attaching" status for several minutes.  I finally got tired of waiting and mounted the volume, and then the status switched right away.  Can't say whether that was cause-and-effect or coincidence, but it worked.

Once you've attached the EBS volume, login to the instance (instructions here) and mount the volume as follows. This should be pretty much instantaneous.

    mkdir /mnt/wex_data
    mount /dev/sdf /mnt/wex_data

Now import the data into the Hadoop file system:

    cd /usr/local/hadoop/
    hadoop fs -copyFromLocal /mnt/wex_data/rawd/freebase-wex-2009-01-12-articles.tsv wex-data

If you want, you can now remove and delete the EBS volume.  The articles file is stored in the distributed filesystem across  your EC2 instances in  you hadoop cluster.  The nice thing is that you can get to this point within less than an hour, meaning that you only have to pay a tiny fraction of the monthly storage cost.

    ec2-detach-volume vol-aaaaaaaa -i i-bbbbbbbbb -d /dev/sdf
    ec2-delete-volume vol-aaaaaaaa

I had some trouble detaching volumes until I used the force flag: -f.  Maybe I was just being impatient again.

That's enough for the moment.  I'll tackle python in my next post.