Hadoop Streaming with Python
21 Nov 2015Hadoop provides a very rich API interface for developing and running MapReduce jobs in Java, however this is not always everybody’s preference. Hadoop Streaming makes it possible to run MapReduce jobs with any language that can access the standard streams STDIN
and STDOUT
.
Hadoop Streaming creates the plumbing required to build a full map reduce job out to your cluster so that all you need to do is supply a mapper and reducer that uses STDIN for their input and STDOUT for their output.
In today’s example, we’ll re-implement the word count example with python using streaming.
The mapper
In this case, the mapper’s job is to take a line of text (input) a break it into words. We’ll then write the word along with the number 1
to denote that we’ve counted it.
The reducer
The reducers’ job is to come through and process the output of the map function, perform some aggregative operation over the set and produce an output set on this information. In this example, it’ll take the word and each of the 1’s, accumulating them to form a word count.
input | map | sort | reduce
Before we full scale with this job, we can simulate the work that the Hadoop cluster would do for us by using our shell and pipe indirection to test it out. This is not a scale solution, so make sure you’re only giving it a small set of data. We can really treat this process as:
The Zen and the Art of the Internet should do, just fine.
We can now submit this job to the hadoop cluster like so. Remember, we need access to our source data, mapper and reducer from the namenode where we’ll submit this job from.
Submitting your job on the cluster
First, we need to get our input data in an accessible spot on the cluster.
Make sure it’s there:
Now, we can run the job.
The -mapper
and -reducer
switches are referring to files on the actual linux node whereas -input
and -output
are referring to HDFS locations.
Results
The results are now available for you in /user/hadoop/zen10-output
.
You should see the results start spraying down the page.
Limitation
So far, the only limitation that I’ve come across with this method of creating map reduce jobs is that the mapper will only work line-by-line. You can’t treat a single record as information spanning across multiple lines. Having information span across multiple lines in your data file should be a rare use case though.