Tuesday, May 15, 2012

Python mapreduce on EC2

Last week, I wrote about getting AWS public datasets onto an EC2 cluster, and then into HDFS for MapReduce.  Now let's get to hello world (or rather, countWords) with python scripts.


#!/usr/bin/env python
# mapper2.py
import sys, re
 
for line in sys.stdin:
    line = line.lower()
    words = line.split()
 
    #--- output tuples [word, 1] in tab-delimited format---
    for word in words: 
        print '%s\t%s' % (word, "1")


Here's the reducer script....

#!/usr/bin/env python
# reducer.py
 
import sys
 
# maps words to their counts
word2count = {}
 
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
 
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        continue
 
    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count
 
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )



The command to execute all this in hadoop is a bit of a monster, mainly because of all the filepaths.  Note the usage of the -file parameter, which tells hadoop to load files for use in the -mapper and -reducer arguments. Also, I set -jobconf compression to false, because I didn't have a handy LZO decompresser installed.

bin/hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -input wex-data -output output/run9 -file /usr/local/hadoop-0.19.0/my_scripts/mapper2.py -file /usr/local/hadoop-0.19.0/my_scripts/reducer.py -mapper mapper2.py -reducer reducer.py -jobconf mapred.output.compress=false


NB: As I dug into this task, I discovered several pretty good python/hadoop-streaming tutorials online.  The scripts here were modified from:  http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_3.2_--_Using_Your_Own_Streaming_WordCount_program


Other sources:

    http://www.protocolostomy.com/2008/03/20/hadoop-ec2-s3-and-me/
    http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
    http://www.larsgeorge.com/2010/10/hadoop-on-ec2-primer.html
    http://www.princesspolymath.com/princess_polymath/?p=137
    http://arunxjacob.blogspot.com/2009/04/configuring-hadoop-cluster-on-ec2.html

    http://wiki.apache.org/hadoop/AmazonS3

    http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

    http://www.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/
    http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

2 comments:

  1. There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

    Big Data Training | Big Data Course in Chennai

    ReplyDelete
  2. Welcome to Wiztech Automation - Embedded System Training in Chennai. We have knowledgeable Team for Embedded Courses handling and we also are after Job Placements offer provide once your Successful Completion of Course. We are Providing on Microcontrollers such as 8051, PIC, AVR, ARM7, ARM9, ARM11 and RTOS. Free Accommodation, Individual Focus, Best Lab facilities, 100% Practical Training and Job opportunities.

    Embedded System Training in chennai
    Embedded System Training Institute in chennai
    Embedded Training in chennai
    Embedded Course in chennai
    Best Embedded System Training in chennai
    Best Embedded System Training Institute in chennai
    Best Embedded System Training Institutes in chennai
    Embedded Training Institute in chennai
    Embedded System Course in chennai
    Best Embedded System Training in chennai

    ReplyDelete