Hadoop a preliminary study of the Stream-CodePudding

The principle of a,
Hadoop Streaming is Hadoop provides a programming tool, it allows users to use any executable file or a script file as a Mapper and Reducer, such as: using shell script some command of the language as a Mapper and Reducer (cat as Mapper, wc as Reducer)
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop - * - streaming. Jar \
- input myInputDirs \
- the output myOutputDir \
- mapper cat \
- reducer for wc
Mapper and reducer can be read from standard input user data, line processing backwardness to standard output, Streaming tool will create graphs, sent to each tasktracker, monitoring the implementation of the whole operation process at the same time,
If a file (or executable scripts) as a mapper, when the mapper is initialized, each mapper task will get the file as a separate process started, mapper task is running, it is the input segmentation lines and each line provided to executable process's standard input, at the same time, the mapper collection of executable process the content of the standard output, and received the content of each line into the key/value pairs, as mapper output, by default, the first TAB in the line before a part as the key of the following (not including TAB) as the value, if there is no TAB, the entire line as a key value, the value value is null, however, it can be customized, in the following paragraphs will introduce how to customize the key and the value of segmentation approach,
For reducer, similar to,
Above is the Map/Reduce framework and streaming the basic communication protocol between mapper/reducer,

Second, grammar

1, the basic grammarUsage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/contrib/streaming/hadoop - * - streaming. Jar [options]
Options:
(1) - input: input file path
(2) - the output: the output file path
(3) - mapper: users to write their own mapper program, can be an executable file or a script
(4) - reducer, reducer of users to write their own procedures, can be an executable file or a script
(5) - file: packaging files to submit homework, can be mapper or reducer for input file to be used, such as the configuration file, dictionaries, etc.,
(6) - partitioner: user custom partitioner program
(7) - combiner: user custom combiner program (must use Java implementation)
Some properties (8) - D: (formerly used - jonconf), specific include:
1) mapred. Map. The tasks: the map task number
2) mapred. Reduce. The tasks: the reduce task number
. 3) the stream. The map. The input field. The separator/stream. The map. The output. The field. The separator: map task input/output data of the separator, the default are \ t,
4) stream. Num. Map. The output. The key. The fields: specify the map task output record of key domain of the number of
5) stream. Reduce. Input. Field. The separator/stream. Reduce the output. The field. The separator: reduce task input/output data of the separator, the default are \ t,
6) stream. Num. Reduce. The output. The key. The fields: specify the reduce task output record of key accounts for the number of domain,
Sometimes you just need to input data is processed by the map function, then just put mapred. Reduce. The tasks set to zero, map/reduce framework will not create reducer task, mapper task output is the final output of the whole operation, in order to do downward compatibility, Hadoop Streaming also support "- to reduce None" option, it has to do with "- jobconf mapred. Reduce. The tasks=0" equivalent,
2, the extended syntax
Had mentioned before, when the Map/Reduce framework from the standard input of mapper while reading a line, it put this line segmentation as the key/value pairs, by default, each row of the first TAB character before a part as the key of the later part of the value (not including the TAB character), as a
, however, the user can also custom, you can specify a delimiter is other characters instead of the default TAB, or specified in the first n (n>=1) a separator in segmentation instead of the default first, for example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop - streaming. Jar \
- input myInputDirs \
- the output myOutputDir \
- mapper org, apache hadoop. Mapred. Lib. IdentityMapper \
- reducer org, apache hadoop. Mapred. Lib. IdentityReducer \
- jobconf stream. The map. The output. The field. The separator=. \
- jobconf stream. Num. Map. The output. The key. The fields=4
In the example above, "- jobconf stream. The map. The output. The field. The separator.=" specified ", "as the map output content separators, and from the part by the end of the fourth". "as the key, then the part as the value of (not including the fourth". "), if in a row, "" less than 4, the content of the whole line as the key, the value is set to empty the Text object (like this creates a Text: new Text (" ")),
By the same token, the user can also use "- jobconf stream. Reduce. The output. The field. The separator=SEP" and "- jobconf stream. Num. Reduce. The output. The fields=num" to specify the reduce of the output lines, in which a delimiter segmentation key and the value,

Three, instance
To illustrate various language Hadoop Streaming application method, the following WordCount, for example, the main functions of the WordCount homework is to count all strings in the user input data,
1, shell
# vi mapper. Sh
#!/bin/bash
While the read LINE; Do
For the word in $LINE
Do
Echo "$1" word
The done
The done
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
# vi reducer. Sh
#!/bin/bash
The count=0
Started=0
The word=""
While the read LINE; Do
Newword=` echo $LINE | the cut - d '-f 1 `
If [" $word "!="$newword"]. Then
[0] $started - can be & amp; & Echo - e "$word \ t $count"
The word=$newword
The count=1
Started=1
The else
The count=$(($count + 1))
Fi
The done
Echo - e "$word \ t $count"
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Local test: cat input. TXT | sh mapper. Sh | sort | sh reducer. Sh
Cluster testing:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop - * - streaming. Jar \
- input myInputDirs \
- the output myOutputDir \
- mapper mapper. Sh \
- reducer reducer. Sh
If you execute the script prompt: "under Caused by: Java. IO, IOException: always run the program"/user/hadoop/Mapper ": the error=2, No to the file or directory", then find an executable program, when can submit a job, using the specified file - the file option, such as in the example above, you can use "- the file Mapper. Py - file reducer. Py", in this way, the hadoop these two files will be automatically distributed to various nodes, such as:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop - * - streaming. Jar \
- input myInputDirs \
- the output myOutputDir \
- mapper mapper. Sh \
- reducer reducer. Sh \
- the file mapper. Sh \
- the file reducer. Sh
2, python
# vi mapper. Py
#! The/usr/bin/env python
The import sys
# maps words to their counts
Word2count={}
# input comes from STDIN (standard input)
For the line in sys. Stdin:
# remove leading and trailing whitespace
The line=line. Strip ()
# the split the line into words while o any empty strings
Words=filter (lambda word: word, line. The split ())
# happens counters
For the word in words:
# for the results to STDOUT (standard output);
# what we output here will be the input for the
# the Reduce step, i.e. for the input for reducer. Py
#
# TAB delimited; The trivial word count is 1
Print '% s \ t % s' % (word, 1)
nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull