In today’s post, I’m going to refresh the information in some previousarticles that I have written to bring them up to date.
Job, Mapper and Reducer
It’s pretty easy to bundle your job, mapper and reducer together. If they’re small enough, it makes sense to do so.
There isn’t much that has changed here. Reading the type annotations can be a little hairy. You can always lookup the documentation for Mapper and for Reducer. The type definitions are uniform:
ToolRunner
The ToolRunner class simplifies the execution management of a MapReduce job using the interface, Tool. Tool is a very simple interface, only providing implementing classes with a contract to run. The run method looks like this:
args are supplied as usual from the main method of the job. A typical implementation of the run method will retrieve configuration information, setup the job and execute.
A special part of the magic here is wrapped up in the GenericOptionsParser which takes in the standard set of command line parameters and plumbs them directly into the job’s configuration.
Finishing up
So there are a couple of features that are provided for you with this wrapper around the run function. Your main method ends up very simple:
Your job is then invoked from the command line using the hadoop command: