Tuesday, May 15, 2012

Python mapreduce on EC2

Last week, I wrote about getting AWS public datasets onto an EC2 cluster, and then into HDFS for MapReduce.  Now let's get to hello world (or rather, countWords) with python scripts.


#!/usr/bin/env python
# mapper2.py
import sys, re
 
for line in sys.stdin:
    line = line.lower()
    words = line.split()
 
    #--- output tuples [word, 1] in tab-delimited format---
    for word in words: 
        print '%s\t%s' % (word, "1")


Here's the reducer script....

#!/usr/bin/env python
# reducer.py
 
import sys
 
# maps words to their counts
word2count = {}
 
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
 
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        continue
 
    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count
 
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )



The command to execute all this in hadoop is a bit of a monster, mainly because of all the filepaths.  Note the usage of the -file parameter, which tells hadoop to load files for use in the -mapper and -reducer arguments. Also, I set -jobconf compression to false, because I didn't have a handy LZO decompresser installed.

bin/hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -input wex-data -output output/run9 -file /usr/local/hadoop-0.19.0/my_scripts/mapper2.py -file /usr/local/hadoop-0.19.0/my_scripts/reducer.py -mapper mapper2.py -reducer reducer.py -jobconf mapred.output.compress=false


NB: As I dug into this task, I discovered several pretty good python/hadoop-streaming tutorials online.  The scripts here were modified from:  http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_3.2_--_Using_Your_Own_Streaming_WordCount_program


Other sources:

    http://www.protocolostomy.com/2008/03/20/hadoop-ec2-s3-and-me/
    http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
    http://www.larsgeorge.com/2010/10/hadoop-on-ec2-primer.html
    http://www.princesspolymath.com/princess_polymath/?p=137
    http://arunxjacob.blogspot.com/2009/04/configuring-hadoop-cluster-on-ec2.html

    http://wiki.apache.org/hadoop/AmazonS3

    http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

    http://www.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/
    http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

6 comments:

  1. I just see the post i am so happy the post of information's.So I have really enjoyed and reading your blogs for these posts.Any way I’ll be subscribing to your feed and I hope you post again soon.
    Python Training in Chennai

    ReplyDelete
  2. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.

    Amazon Web Services Training in Chennai



    Best Java Training Institute Chennai

    ReplyDelete
  3. This information is impressive. I am inspired with your post writing style & how continuously you describe this topic. Eagerly waiting for your new blog keep doing more.
    AWS Training in Chennai
    aws training in bangalore
    AWS Course in Chennai
    aws course in bangalore
    AWS Training centers in Chennai
    aws certification training in bangalore

    ReplyDelete
  4. Such an excellent and interesting blog, do post like this more with more information, this was very useful, Thank you.
    Aviation Academy in Chennai
    Aviation Courses in Chennai
    best aviation academy in chennai
    aviation training in chennai

    ReplyDelete
  5. Hi, thank you very much for new information, i learned something new. Very well written.It was so good to read and usefull to improve knowledge.Keep posting. If you are looking for any big data hadoop related information please visit our website.
    big data hadoop training in bangalore.

    ReplyDelete