For Hadoop streaming, we are considering the word-count problem. Any
job in Hadoop must have two phases: mapper and reducer. We have written
codes for the mapper and the reducer in python script to run it under
Hadoop. One can also write the same in Perl and Ruby.
Mapper Phase Code
!/usr/bin/python import sys # Input takes from standard input for myline in sys.stdin: # Remove whitespace either side myline = myline.strip() # Break the line into words words = myline.split() # Iterate the words list for myword in words: # Write the results to standard output print '%s\t%s' % (myword, 1)Make sure this file has execution permission (chmod +x /home/ expert/hadoop-1.2.1/mapper.py).
Reducer Phase Code
#!/usr/bin/python from operator import itemgetter import sys current_word = "" current_count = 0 word = "" # Input takes from standard input for myline in sys.stdin: # Remove whitespace either side myline = myline.strip() # Split the input we got from mapper.py word, count = myline.split('\t', 1) # Convert count variable to integer try: count = int(count) except ValueError: # Count was not a number, so silently ignore this line continue if current_word == word: current_count += count else: if current_word: # Write result to standard output print '%s\t%s' % (current_word, current_count) current_count = count current_word = word # Do not forget to output the last word if needed! if current_word == word: print '%s\t%s' % (current_word, current_count)
Save the mapper and reducer codes in mapper.py and reducer.py in Hadoop
home directory. Make sure these files have execution permission (chmod
+x mapper.py and chmod +x reducer.py). As python is indentation
sensitive so the same code can be download from the below link.
No comments:
Post a Comment