compSocSci: Python mapreduce on EC2

Tuesday, May 15, 2012

Python mapreduce on EC2

Last week, I wrote about getting AWS public datasets onto an EC2 cluster, and then into HDFS for MapReduce. Now let's get to hello world (or rather, countWords) with python scripts.

#!/usr/bin/env python
# mapper2.py

import sys, re
 
for line in sys.stdin:
    line = line.lower()
    words = line.split()
 
    #--- output tuples [word, 1] in tab-delimited format---
    for word in words: 
        print '%s\t%s' % (word, "1")

Here's the reducer script....

#!/usr/bin/env python
# reducer.py
 
import sys
 
# maps words to their counts
word2count = {}
 
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
 
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        continue
 
    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count
 
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )

The command to execute all this in hadoop is a bit of a monster, mainly because of all the filepaths. Note the usage of the -file parameter, which tells hadoop to load files for use in the -mapper and -reducer arguments. Also, I set -jobconf compression to false, because I didn't have a handy LZO decompresser installed.

bin/hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -input wex-data -output output/run9 -file /usr/local/hadoop-0.19.0/my_scripts/mapper2.py -file /usr/local/hadoop-0.19.0/my_scripts/reducer.py -mapper mapper2.py -reducer reducer.py -jobconf mapred.output.compress=false

NB: As I dug into this task, I discovered several pretty good python/hadoop-streaming tutorials online. The scripts here were modified from: http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_3.2_--_Using_Your_Own_Streaming_WordCount_program

Other sources:

http://www.protocolostomy.com/2008/03/20/hadoop-ec2-s3-and-me/
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://www.larsgeorge.com/2010/10/hadoop-on-ec2-primer.html
http://www.princesspolymath.com/princess_polymath/?p=137
http://arunxjacob.blogspot.com/2009/04/configuring-hadoop-cluster-on-ec2.html

http://wiki.apache.org/hadoop/AmazonS3

http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

http://www.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

6 comments:

Priya KannanJune 15, 2017 at 1:21 AM
I just see the post i am so happy the post of information's.So I have really enjoyed and reading your blogs for these posts.Any way I’ll be subscribing to your feed and I hope you post again soon.
Python Training in Chennai
ReplyDelete
Replies
UnknownMay 3, 2018 at 11:19 PM
I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.

Amazon Web Services Training in Chennai

Best Java Training Institute Chennai
ReplyDelete
Replies
sandhiyaDecember 16, 2018 at 11:26 PM
This information is impressive. I am inspired with your post writing style & how continuously you describe this topic. Eagerly waiting for your new blog keep doing more.
AWS Training in Chennai
aws training in bangalore
AWS Course in Chennai
aws course in bangalore
AWS Training centers in Chennai
aws certification training in bangalore
ReplyDelete
Replies
AnonymousDecember 19, 2018 at 7:17 PM
Such an excellent and interesting blog, do post like this more with more information, this was very useful, Thank you.
Aviation Academy in Chennai
Aviation Courses in Chennai
best aviation academy in chennai
aviation training in chennai
ReplyDelete
Replies
easylearnAugust 22, 2019 at 3:04 AM
Hi, thank you very much for new information, i learned something new. Very well written.It was so good to read and usefull to improve knowledge.Keep posting. If you are looking for any big data hadoop related information please visit our website.
big data hadoop training in bangalore.
ReplyDelete
Replies
vinosparkJuly 11, 2023 at 10:05 AM
Thanks for sharing this post
Sai Satcharitra Bengali pdf
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)