Pages

Monday 5 September 2016

Important Commandsin Hadoop Streaming




Parameters
Options
Description
-input directory/file-name
Required
Input location for mapper.
-output directory-name
Required
Output location for reducer
-mapper executable or script or JavaClassName
Required
Mapper executable.
-reducer executable or script or JavaClassName
Required
Reducer executable.
-file file-name
Optional
Makes the mapper, reducer, or combiner executable available locally on the compute nodes.
-inputformat JavaClassName
Optional
Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default.
-outputformat JavaClassName
Optional
Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default.
-partitioner JavaClassName
Optional
Class that determines which reduce a key is sent to.
-combiner streamingCommand or JavaClassName
Optional
Combiner executable for map output.
-cmdenv name=value
Optional
Passes the environment variable to streaming commands.
-inputreader
Optional
For backwards-compatibility: specifies a record reader class (instead of an input format class).
-verbose
Optional
Verbose output.
-lazyOutput
Optional
Creates output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write).
-numReduceTasks
Optional
Specifies the number of reducers.
-mapdebug
Optional
Script to call when map task fails.
-reducedebug
Optional
Script to call when reduce task fails.

Sunday 4 September 2016

How Streaming Works in HADOOP

In the above example, both the mapper and the reducer are python scripts that read the input from standard input and emit the output to standard output. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

When a script is specified for mappers, each mapper task will launch the script as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the standard input (STDIN) of the process. In the meantime, the mapper collects the line-oriented outputs from the standard output (STDOUT) of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then the entire line is considered as the key and the value is null. However, this can be customized, as per one need.

When a script is specified for reducers, each reducer task will launch the script as a separate process, then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the standard input (STDIN) of the process. In the meantime, the reducer collects the line-oriented outputs from the standard output (STDOUT) of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized as per specific requirements.

Saturday 3 September 2016

Execution of WordCount Program

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.
2.1.jar \
   -input input_dirs \ 
   -output output_dir \ 
   -mapper <path/mapper.py \ 
   -reducer <path/reducer.py
Where "\" is used for line continuation for clear readability.

For Example,

./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar 
 -input myinput -output myoutput -mapper  
/home/expert/hadoop-1.2.1/mapper.py -reducer  
/home/expert/hadoop-1.2.1/reducer.py

Friday 2 September 2016

Hadoop streaming - Example Using Python

For Hadoop streaming, we are considering the word-count problem. Any job in Hadoop must have two phases: mapper and reducer. We have written codes for the mapper and the reducer in python script to run it under Hadoop. One can also write the same in Perl and Ruby.

Mapper Phase Code

!/usr/bin/python
import sys
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Break the line into words words = myline.split() 
# Iterate the words list for myword in words: 
# Write the results to standard output print '%s\t%s' % (myword, 1)
Make sure this file has execution permission (chmod +x /home/ expert/hadoop-1.2.1/mapper.py).

Reducer Phase Code

#!/usr/bin/python
from operator import itemgetter 
import sys 
current_word = ""
current_count = 0 
word = "" 
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Split the input we got from mapper.py word, count = myline.split('\t', 1) 
# Convert count variable to integer 
   try: 
      count = int(count) 
except ValueError: 
   # Count was not a number, so silently ignore this line continue
if current_word == word: 
   current_count += count 
else: 
   if current_word: 
      # Write result to standard output print '%s\t%s' % (current_word, current_count) 
   current_count = count
   current_word = word
# Do not forget to output the last word if needed! 
if current_word == word: 
   print '%s\t%s' % (current_word, current_count)
 
 
Save the mapper and reducer codes in mapper.py and reducer.py in Hadoop 
home directory. Make sure these files have execution permission (chmod 
+x mapper.py and chmod +x reducer.py). As python is indentation 
sensitive so the same code can be download from the below link. 

Thursday 1 September 2016

What is Hadoop - Streaming ?

Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Thursday 4 August 2016

To kill the MapReduce job


$ $HADOOP_HOME/bin/hadoop job -kill <JOB-ID> 
e.g. 
$ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004 

How To see the history of MapReduce job output-dir

 

$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME> 
e.g. 
$ $HADOOP_HOME/bin/hadoop job -history /user/expert/output 

How To see the status of Mapreduce job

status of job

$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID> 
e.g. 
$ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004 

How to Interact with MapReduce Jobs

Usage: hadoop job [GENERIC_OPTIONS]
The following are the Generic Options available in a Hadoop job.


GENERIC_OPTIONS
Description
-submit <job-file>
Submits the job.
-status <job-id>
Prints the map and reduce completion percentage and all job counters.
-counter <job-id> <group-name> <countername>
Prints the counter value.
-kill <job-id>
Kills the job.
-events <job-id> <fromevent-#> <#-of-events>
Prints the events' details received by jobtracker for the given range.
-history [all] <jobOutputDir> - history < jobOutputDir>
Prints job details, failed and killed tip details. More details about the job such as successful tasks and task attempts made for each task can be viewed by specifying the [all] option.
-list[all]
Displays all jobs. -list displays only jobs which are yet to complete.
-kill-task <task-id>
Kills the task. Killed tasks are NOT counted against failed attempts.
-fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.
-set-priority <job-id> <priority>
Changes the priority of the job. Allowed priority values are VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW
 

Important Commands

All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command. Running the Hadoop script without any arguments prints the description for all commands.
Usage : hadoop [--config confdir] COMMAND

The following table lists the options available and their description.

Options
Description
namenode -format
Formats the DFS filesystem.
secondarynamenode
Runs the DFS secondary namenode.
namenode
Runs the DFS namenode.
datanode
Runs a DFS datanode.
dfsadmin
Runs a DFS admin client.
mradmin
Runs a Map-Reduce admin client.
fsck
Runs a DFS filesystem checking utility.
fs
Runs a generic filesystem user client.
balancer
Runs a cluster balancing utility.
oiv
Applies the offline fsimage viewer to an fsimage.
fetchdt
Fetches a delegation token from the NameNode.
jobtracker
Runs the MapReduce job Tracker node.
pipes
Runs a Pipes job.
tasktracker
Runs a MapReduce task Tracker node.
historyserver
Runs job history servers as a standalone daemon.
job
Manipulates the MapReduce jobs.
queue
Gets information regarding JobQueues.
version
Prints the version.
jar <jar>
Runs a jar file.
distcp <srcurl> <desturl>
Copies file or directories recursively.
distcp2 <srcurl> <desturl>
DistCp version 2.
archive -archiveName NAME -p
Creates a hadoop archive.
<parent path> <src>* <dest>

classpath
Prints the class path needed to get the Hadoop jar and the required libraries.
daemonlog
Get/Set the log level for each daemon