#!/usr/bin/env python # mapper2.py
import sys, re for line in sys.stdin: line = line.lower() words = line.split() #--- output tuples [word, 1] in tab-delimited format--- for word in words: print '%s\t%s' % (word, "1")
Here's the reducer script....
#!/usr/bin/env python # reducer.py import sys # maps words to their counts word2count = {} # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('\t', 1) # convert count (currently a string) to int try: count = int(count) except ValueError: continue try: word2count[word] = word2count[word]+count except: word2count[word] = count # write the tuples to stdout # Note: they are unsorted for word in word2count.keys(): print '%s\t%s'% ( word, word2count[word] )
The command to execute all this in hadoop is a bit of a monster, mainly because of all the filepaths. Note the usage of the -file parameter, which tells hadoop to load files for use in the -mapper and -reducer arguments. Also, I set -jobconf compression to false, because I didn't have a handy LZO decompresser installed.
bin/hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -input wex-data -output output/run9 -file /usr/local/hadoop-0.19.0/my_scripts/mapper2.py -file /usr/local/hadoop-0.19.0/my_scripts/reducer.py -mapper mapper2.py -reducer reducer.py -jobconf mapred.output.compress=false
NB: As I dug into this task, I discovered several pretty good python/hadoop-streaming tutorials online. The scripts here were modified from: http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_3.2_--_Using_Your_Own_Streaming_WordCount_program
Other sources:
http://www.protocolostomy.com/2008/03/20/hadoop-ec2-s3-and-me/
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://www.larsgeorge.com/2010/10/hadoop-on-ec2-primer.html
http://www.princesspolymath.com/princess_polymath/?p=137
http://arunxjacob.blogspot.com/2009/04/configuring-hadoop-cluster-on-ec2.html
http://wiki.apache.org/hadoop/AmazonS3
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
http://www.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html
I just see the post i am so happy the post of information's.So I have really enjoyed and reading your blogs for these posts.Any way I’ll be subscribing to your feed and I hope you post again soon.
ReplyDeletePython Training in Chennai
I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
ReplyDeleteAmazon Web Services Training in Chennai
Best Java Training Institute Chennai
This information is impressive. I am inspired with your post writing style & how continuously you describe this topic. Eagerly waiting for your new blog keep doing more.
ReplyDeleteAWS Training in Chennai
aws training in bangalore
AWS Course in Chennai
aws course in bangalore
AWS Training centers in Chennai
aws certification training in bangalore
Such an excellent and interesting blog, do post like this more with more information, this was very useful, Thank you.
ReplyDeleteAviation Academy in Chennai
Aviation Courses in Chennai
best aviation academy in chennai
aviation training in chennai
Hi, thank you very much for new information, i learned something new. Very well written.It was so good to read and usefull to improve knowledge.Keep posting. If you are looking for any big data hadoop related information please visit our website.
ReplyDeletebig data hadoop training in bangalore.
Thanks for sharing this post
ReplyDeleteSai Satcharitra Bengali pdf