In the above example, both the mapper and the reducer are python
scripts that read the input from standard input and emit the output to
standard output. The utility will create a Map/Reduce job, submit the
job to an appropriate cluster, and monitor the progress of the job until
it completes.
When a script is specified for mappers, each mapper task will launch
the script as a separate process when the mapper is initialized. As the
mapper task runs, it converts its inputs into lines and feed the lines
to the standard input (STDIN) of the process. In the meantime, the
mapper collects the line-oriented outputs from the standard output
(STDOUT) of the process and converts each line into a key/value pair,
which is collected as the output of the mapper. By default, the prefix
of a line up to the first tab character is the key and the rest of the
line (excluding the tab character) will be the value. If there is no tab
character in the line, then the entire line is considered as the key
and the value is null. However, this can be customized, as per one need.
When a script is specified for reducers, each reducer task will
launch the script as a separate process, then the reducer is
initialized. As the reducer task runs, it converts its input key/values
pairs into lines and feeds the lines to the standard input (STDIN) of
the process. In the meantime, the reducer collects the line-oriented
outputs from the standard output (STDOUT) of the process, converts each
line into a key/value pair, which is collected as the output of the
reducer. By default, the prefix of a line up to the first tab character
is the key and the rest of the line (excluding the tab character) is the
value. However, this can be customized as per specific requirements.
No comments:
Post a Comment