Parameters
|
Options
|
Description
|
-input
directory/file-name
|
Required
|
Input
location for mapper.
|
-output
directory-name
|
Required
|
Output
location for reducer
|
-mapper
executable or script or JavaClassName
|
Required
|
Mapper
executable.
|
-reducer
executable or script or JavaClassName
|
Required
|
Reducer
executable.
|
-file
file-name
|
Optional
|
Makes
the mapper, reducer, or combiner executable available locally on the compute
nodes.
|
-inputformat
JavaClassName
|
Optional
|
Class
you supply should return key/value pairs of Text class. If not specified,
TextInputFormat is used as the default.
|
-outputformat
JavaClassName
|
Optional
|
Class
you supply should take key/value pairs of Text class. If not specified,
TextOutputformat is used as the default.
|
-partitioner
JavaClassName
|
Optional
|
Class
that determines which reduce a key is sent to.
|
-combiner
streamingCommand or JavaClassName
|
Optional
|
Combiner
executable for map output.
|
-cmdenv
name=value
|
Optional
|
Passes
the environment variable to streaming commands.
|
-inputreader
|
Optional
|
For
backwards-compatibility: specifies a record reader class (instead of an input
format class).
|
-verbose
|
Optional
|
Verbose
output.
|
-lazyOutput
|
Optional
|
Creates
output lazily. For example, if the output format is based on
FileOutputFormat, the output file is created only on the first call to
output.collect (or Context.write).
|
-numReduceTasks
|
Optional
|
Specifies
the number of reducers.
|
-mapdebug
|
Optional
|
Script
to call when map task fails.
|
-reducedebug
|
Optional
|
Script
to call when reduce task fails.
|
Monday, 5 September 2016
Important Commandsin Hadoop Streaming
Sunday, 4 September 2016
How Streaming Works in HADOOP
In the above example, both the mapper and the reducer are python
scripts that read the input from standard input and emit the output to
standard output. The utility will create a Map/Reduce job, submit the
job to an appropriate cluster, and monitor the progress of the job until
it completes.
When a script is specified for mappers, each mapper task will launch
the script as a separate process when the mapper is initialized. As the
mapper task runs, it converts its inputs into lines and feed the lines
to the standard input (STDIN) of the process. In the meantime, the
mapper collects the line-oriented outputs from the standard output
(STDOUT) of the process and converts each line into a key/value pair,
which is collected as the output of the mapper. By default, the prefix
of a line up to the first tab character is the key and the rest of the
line (excluding the tab character) will be the value. If there is no tab
character in the line, then the entire line is considered as the key
and the value is null. However, this can be customized, as per one need.
When a script is specified for reducers, each reducer task will
launch the script as a separate process, then the reducer is
initialized. As the reducer task runs, it converts its input key/values
pairs into lines and feeds the lines to the standard input (STDIN) of
the process. In the meantime, the reducer collects the line-oriented
outputs from the standard output (STDOUT) of the process, converts each
line into a key/value pair, which is collected as the output of the
reducer. By default, the prefix of a line up to the first tab character
is the key and the rest of the line (excluding the tab character) is the
value. However, this can be customized as per specific requirements.
Saturday, 3 September 2016
Execution of WordCount Program
$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1. 2.1.jar \ -input input_dirs \ -output output_dir \ -mapper <path/mapper.py \ -reducer <path/reducer.pyWhere "\" is used for line continuation for clear readability.
For Example,
./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar
-input myinput -output myoutput -mapper
/home/expert/hadoop-1.2.1/mapper.py -reducer
/home/expert/hadoop-1.2.1/reducer.py
Friday, 2 September 2016
Hadoop streaming - Example Using Python
For Hadoop streaming, we are considering the word-count problem. Any
job in Hadoop must have two phases: mapper and reducer. We have written
codes for the mapper and the reducer in python script to run it under
Hadoop. One can also write the same in Perl and Ruby.
Mapper Phase Code
!/usr/bin/python import sys # Input takes from standard input for myline in sys.stdin: # Remove whitespace either side myline = myline.strip() # Break the line into words words = myline.split() # Iterate the words list for myword in words: # Write the results to standard output print '%s\t%s' % (myword, 1)Make sure this file has execution permission (chmod +x /home/ expert/hadoop-1.2.1/mapper.py).
Reducer Phase Code
#!/usr/bin/python from operator import itemgetter import sys current_word = "" current_count = 0 word = "" # Input takes from standard input for myline in sys.stdin: # Remove whitespace either side myline = myline.strip() # Split the input we got from mapper.py word, count = myline.split('\t', 1) # Convert count variable to integer try: count = int(count) except ValueError: # Count was not a number, so silently ignore this line continue if current_word == word: current_count += count else: if current_word: # Write result to standard output print '%s\t%s' % (current_word, current_count) current_count = count current_word = word # Do not forget to output the last word if needed! if current_word == word: print '%s\t%s' % (current_word, current_count)
Save the mapper and reducer codes in mapper.py and reducer.py in Hadoop
home directory. Make sure these files have execution permission (chmod
+x mapper.py and chmod +x reducer.py). As python is indentation
sensitive so the same code can be download from the below link.
Thursday, 1 September 2016
What is Hadoop - Streaming ?
Hadoop streaming is a utility that comes with the Hadoop distribution.
This utility allows you to create and run Map/Reduce jobs with any
executable or script as the mapper and/or the reducer.
Thursday, 4 August 2016
To kill the MapReduce job
$ $HADOOP_HOME/bin/hadoop job -kill <JOB-ID> e.g. $ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004
How To see the history of MapReduce job output-dir
$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME> e.g. $ $HADOOP_HOME/bin/hadoop job -history /user/expert/output
How To see the status of Mapreduce job
status of job
$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID> e.g. $ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004
How to Interact with MapReduce Jobs
Usage: hadoop job [GENERIC_OPTIONS]
The following are the Generic Options available in a Hadoop job.
GENERIC_OPTIONS
|
Description
|
-submit <job-file>
|
Submits the job.
|
-status <job-id>
|
Prints the map and reduce
completion percentage and all job counters.
|
-counter <job-id>
<group-name> <countername>
|
Prints the counter value.
|
-kill <job-id>
|
Kills the job.
|
-events <job-id>
<fromevent-#> <#-of-events>
|
Prints the events' details
received by jobtracker for the given range.
|
-history [all]
<jobOutputDir> - history < jobOutputDir>
|
Prints job details, failed and
killed tip details. More details about the job such as successful tasks and
task attempts made for each task can be viewed by specifying the [all]
option.
|
-list[all]
|
Displays all jobs. -list displays
only jobs which are yet to complete.
|
-kill-task <task-id>
|
Kills the task. Killed tasks are
NOT counted against failed attempts.
|
-fail-task <task-id>
|
Fails the task. Failed tasks are
counted against failed attempts.
|
-set-priority <job-id>
<priority>
|
Changes the priority of the job.
Allowed priority values are VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW
|
Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command. Running the Hadoop script without any arguments prints the description for all commands.
Usage : hadoop [--config confdir] COMMAND
The following table lists the options available and their description.
Options
|
Description
|
namenode -format
|
Formats the DFS filesystem.
|
secondarynamenode
|
Runs the DFS secondary namenode.
|
namenode
|
Runs the DFS namenode.
|
datanode
|
Runs a DFS datanode.
|
dfsadmin
|
Runs a DFS admin client.
|
mradmin
|
Runs a Map-Reduce admin client.
|
fsck
|
Runs a DFS filesystem checking
utility.
|
fs
|
Runs a generic filesystem user
client.
|
balancer
|
Runs a cluster balancing utility.
|
oiv
|
Applies the offline fsimage viewer
to an fsimage.
|
fetchdt
|
Fetches a delegation token from
the NameNode.
|
jobtracker
|
Runs the MapReduce job Tracker
node.
|
pipes
|
Runs a Pipes job.
|
tasktracker
|
Runs a MapReduce task Tracker
node.
|
historyserver
|
Runs job history servers as a
standalone daemon.
|
job
|
Manipulates the MapReduce jobs.
|
queue
|
Gets information regarding
JobQueues.
|
version
|
Prints the version.
|
jar <jar>
|
Runs a jar file.
|
distcp <srcurl>
<desturl>
|
Copies file or directories
recursively.
|
distcp2 <srcurl>
<desturl>
|
DistCp version 2.
|
archive -archiveName NAME -p
|
Creates a hadoop archive.
|
<parent path> <src>*
<dest>
|
|
classpath
|
Prints the class path needed to
get the Hadoop jar and the required libraries.
|
daemonlog
|
Get/Set the log level for each
daemon
|
Subscribe to:
Posts (Atom)