I started playing with Hadoop Streaming today because I needed to do the equivalent of the shell script
cat /some/input | cut -f 1 | sort | uniq > /some/output
on an HDFS file.
The basic thing you want to do to get a map working follows. The general rule of thumb is that if there is one or more lines of output for each line of input, then you don’t need to use any reducers, hence the -numReduceTasks 0 option.
$HADOOP_HOME/bin/hadoop jar contrib/streaming/*-streaming.jar -input /some/input -output /some/output -mapper 'cut -f 1' -numReduceTasks 0
In my case though, I wanted to uniqify my list. Putting uniq into the mapper chain would cause the job to fail. Instead I had to drop the -numReduceTasks 0 and do like so:
$HADOOP_HOME/bin/hadoop jar contrib/streaming/*-streaming.jar -input /some/input -output /some/output -mapper 'cut -f 1' -reducer 'uniq'
Note also that I didn’t need to include the sort from my original shell command. That’s because sorting is implicit in the MapReduce process.
As usual, I’m new to all of this, so if you have any insights leave a comment.
Post a Comment