Create a MapReduce Job using Java and Maven
30 Jan 2014Introduction
In a previous post, I walked through the very basic operations of getting a Maven project up and running so that you can start writing Java applications using this managed environment.
In today’s post, I’ll walk through the modifications required to your POM to get a MapReduce job running on Hadoop 2.2.0.
If you don’t have Maven installed yet, do that . . . maybe even have a bit of a read up on what it is, how it helps and how you can use it. Of course you’ll also need your Hadoop environment up and running! Project Setup First thing you’ll need to do, is to create a project structure using Maven in your workspace/source folder. I do this with the following command:
As it runs, this command will ask you a few questions on the details of your project. For all of the questions, I’ve found selecting the default value was sufficient. So . . . enter enter enter !
Once the process is complete, you’ll have a project folder created for you. In this example, my project folder is “wordcount” (you can probably see where this tutorial is now headed). Changing into this folder and having a look at the directory tree, you should see the following:
Now it’s time to change the project environment so that it’ll suit our Hadoop application target.
Adjusting the POM for Hadoop
There’s only a few minor alterations that are required here. The first one is, referencing the Hadoop libraries so that they are available to you to program against. We also specify the type of packaging for the application. Lastly, changing the language version (to something higher than what’s specified as default).
Open up “pom.xml” in your editor of choice and add the following lines into the “dependencies” node.
This tells the project that we need the “hadoop-client” library (version 2.2.0).
We’re now going to tell Maven to make us an executable JAR. Unfortunately, here’s where the post is slightly pre-emptive upon itself. In order to tell Maven that we want an executable JAR, we need to tell it what class is holding our “main” function. . . we haven’t written any code yet - but we will!
Create a “build” node and within that node create a “plugins” node and add the following to it:
More on the maven-jar-plugin plugin can be found on the Maven website, but this block builds an executable JAR for us.
Add this next plugin to use Java 1.7 for compilation:
That’s all that should be needed now to perform compilation and packaging of our Hadoop application.
The Job
I’ll leave writing Hadoop Jobs to another post, but we still need some code to make sure our project is working (for today).
All I have done for today, is taken the WordCount code that’s on the Hadoop Wiki here http://wiki.apache.org/hadoop/WordCount, changed the package name to align with what I created my project as com.test.wordcount
and saved it into src/main/java/com/test/wordcount/WordCount.java
I removed the template provided App.java
that was in this folder. I did make one minor patch to this code also. Here’s my full listing that I’ve used for reference anyway.
Compile, Package & Run!
Our project is setup, our code is in place; it’s now time to compile our project.
Lots of downloading of dependencies and a bit of compilation go on . . . If all has gone to plan, you can now have a package to run. As usual, you’ll need a text file of words to count. I’ve popped one up on hdfs called “input.txt”.
You should now have a map reduce job running!