A few utilities exist to manage your build, dependencies, test running for Java projects. One that I’ve seen that is quite intuitive (once you wrap your head around the xml structure) is Maven. According to the website, Maven is a “software project management and comprehension tool”.
The main benefit I’ve seen already is how the developer’s work-cycle is managed using the “POM” (project object model). The POM is just an XML file that accompanies your project to describe to Maven what your requirements are to build, test & package your software unit.
An excellent, short post can be found on the Maven website called “Maven in 5 minutes”.
Today’s post will focus on Maven installation and getting a “Hello, world” project running.
Installation
I’m on a Debian-flavored Linux distribution, so you may need to translate slightly between package managers. To get Maven installed, issue the following command at the prompt:
sudo apt-get install maven
Check that everything has installed correctly with the following command:
mvn --version
You should see some output not unlike what I’ve got here:
If you’re seeing output like what I’ve got above - that’s it. You’re installed now.
First Project
Getting your first application together is pretty easy. A “quick start” approach is to use the quick start templates to generate a project structure like so:
cd ~/Source
mvn archetype:generate -DgroupId=org.temp -DartifactId=hello -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
Maven will then go out and grab all that it needs from the web to get your project setup. It’s now generated a project structure for you (in a directory called “hello”) that looks like this:
In a previous post, I walked through the very basic operations of getting a Maven project up and running so that you can start writing Java applications using this managed environment.
In today’s post, I’ll walk through the modifications required to your POM to get a MapReduce job running on Hadoop 2.2.0.
If you don’t have Maven installed yet, do that . . . maybe even have a bit of a read up on what it is, how it helps and how you can use it. Of course you’ll also need your Hadoop environment up and running!
Project Setup
First thing you’ll need to do, is to create a project structure using Maven in your workspace/source folder. I do this with the following command:
As it runs, this command will ask you a few questions on the details of your project. For all of the questions, I’ve found selecting the default value was sufficient. So . . . enter enter enter !
Once the process is complete, you’ll have a project folder created for you. In this example, my project folder is “wordcount” (you can probably see where this tutorial is now headed). Changing into this folder and having a look at the directory tree, you should see the following:
~/src/wordcount$ tree
.
├── pom.xml
└── src
├── main
│ └── java
│ └── com
│ └── test
│ └── wordcount
│ └── App.java
└── test
└── java
└── com
└── test
└── wordcount
└── AppTest.java
11 directories, 3 files
Now it’s time to change the project environment so that it’ll suit our Hadoop application target.
Adjusting the POM for Hadoop
There’s only a few minor alterations that are required here. The first one is, referencing the Hadoop libraries so that they are available to you to program against. We also specify the type of packaging for the application. Lastly, changing the language version (to something higher than what’s specified as default).
Open up “pom.xml” in your editor of choice and add the following lines into the “dependencies” node.
This tells the project that we need the “hadoop-client” library (version 2.2.0).
We’re now going to tell Maven to make us an executable JAR. Unfortunately, here’s where the post is slightly pre-emptive upon itself. In order to tell Maven that we want an executable JAR, we need to tell it what class is holding our “main” function. . . we haven’t written any code yet - but we will!
Create a “build” node and within that node create a “plugins” node and add the following to it:
That’s all that should be needed now to perform compilation and packaging of our Hadoop application.
The Job
I’ll leave writing Hadoop Jobs to another post, but we still need some code to make sure our project is working (for today).
All I have done for today, is taken the WordCount code that’s on the Hadoop Wiki here http://wiki.apache.org/hadoop/WordCount, changed the package name to align with what I created my project as com.test.wordcount and saved it into src/main/java/com/test/wordcount/WordCount.java
I removed the template provided App.java that was in this folder. I did make one minor patch to this code also. Here’s my full listing that I’ve used for reference anyway.
Our project is setup, our code is in place; it’s now time to compile our project.
$ mvn clean install
Lots of downloading of dependencies and a bit of compilation go on . . . If all has gone to plan, you can now have a package to run. As usual, you’ll need a text file of words to count. I’ve popped one up on hdfs called “input.txt”.
$ hadoop jar target/wordcount-1.0-SNAPSHOT.jar input.txt wcout
Creating new threads in Haskell is quite easy (once you know how). Here’s a simple snippet for using forkIO and myThreadId to get you started.
moduleMainwhereimportControl.Concurrentmain::IO()main=do-- grab the parent thread id and print itparentId<-myThreadIdputStrLn(showparentId)-- create a new thread (ignore the return)_<-forkIO$do-- grab the child thread id and print itchildId<-myThreadIdputStrLn(showchildId)return()
The great thing about using the current stable version of Debian is that you’re assured that a lot of testing has gone in to ensure that all of the packages you’re looking at are in fact stable - sometimes this works against us as it takes so long for packages to become stable, making the Debian stable repository quite stale with its versions.
In today’s post, I’ll show you how you can install a package from a different repository (other than stable) within your stable Debian environment.
At the time of this writing, I’m currently using “Wheezy” (codename for Stable). This makes “Jessie” the codename for Testing and “Sid” the codename for Unstable.
Adding Software Sources
In order to install software from another repository, you need to tell “apt” where to get the software from. Before making any changes, my /etc/apt/sources.list looks like this:
deb http://ftp.au.debian.org/debian/ wheezy main
deb-src http://ftp.au.debian.org/debian/ wheezy main
deb http://security.debian.org/ wheezy/updates main
deb-src http://security.debian.org/ wheezy/updates main
deb http://ftp.au.debian.org/debian/ wheezy-updates main
deb-src http://ftp.au.debian.org/debian/ wheezy-updates main
The man page for “sources.list” will fill you in on the structure of these lines in your sources.list. For the purposes of this post, just take note that each line mentions “wheezy” at the end.
Without modification, if we were to use “apt-cache policy” we can find out what versions of a particular package are available to us. For the purposes of this post, I’ll use “haskell-platform”. Taking a look at the cache policy for this package:
We’ve got version “2012.2.0.0” available to us in the stable repository. With “2013.2.0.0” as the current version, we can see that stable is a little behind. Let’s try and fix that.
We’re going to add some software from the testing repository, so we’re going to link up with the binary source pointed to “jessie”. To do this, we’ll add one extra line to /etc/apt/sources.list, like so:
deb http://ftp.au.debian.org/debian/ wheezy main
deb-src http://ftp.au.debian.org/debian/ wheezy main
deb http://ftp.au.debian.org/debian/ jessie main
deb http://security.debian.org/ wheezy/updates main
deb-src http://security.debian.org/ wheezy/updates main
deb http://ftp.au.debian.org/debian/ wheezy-updates main
deb-src http://ftp.au.debian.org/debian/ wheezy-updates main
Note the third line (new) that mentions “jessie”.
Setting Priorities
Now that we’ve confused apt, by mixing software sources - we need to set some priorities where the stable repository will take precedence over the testing repository.
To do this, we open/create the file /etc/apt/preferences. In this file, we can list out all of the repositories that we’d like to use and assign a priority to them. Here’s the sample putting a higher priority on stable:
The instructions here are defining what packages these rules apply to, which release they apply to and what priority is to be applied. Now that we’ve put these priorities in place, we’ll update our local software cache:
In programming, RAII stands for “Resource Acquisition is Initialization” and it’s an idiom or technique established by Bjarne Stroustrup to ease resource allocation and deallocation in C++.
Common problems have been when an exception is thrown during initialization, any memory associated during construction (or underlying resources) aren’t released, creating memory leaks in applications.
The Idea
The basic premise is that resource allocation is to be performed in the constructor of your class. Release of the resources occurs in your destructor. The example given on the Wikipedia page deals with holding a lock/mutex for a given file. When execution leaves the scope of the code (whether it be from premature termination of an exception or from the code naturally exiting), the destructors run to release the file handle and lock.
The concept is a great way to not only clean up your code (as all of the “if !null” code is now redundant) but it’s a great safe-guard that you can almost be absent minded about.
It’s important to note that this idiom doesn’t allow you to ignore good exception handling practice. You’re still expected to use exception handling in your code, this will just ensure that your cleanup/release code is executed as expected.
An Implementation
Implementing this idea into your own code is really quite simple. If you have a resource (handle) that you’re managing manually, wrap it in a class.
Ensure the constructor takes the handle in
Release the handle in the destructor
When working with OpenGL textures, I use a very small class that allows me to handle the resource cleanup, it just managed the generated texture id. When the class falls out of scope or there’s a failure during initialization, the texture is cleaned up.
classtexture{public:// manage the generated texture idtexture(constGLuintt):_reference(t){}// cleanup of the allocated resourcevirtual~texture(void);// provide access to the referenceconstGLuintreference()const{return_reference;}private:GLuint_reference;};texture::~texture(void){// only run if we have something to clean upif(this->_reference!=0){// clear out the texture glDeleteTextures(1,&this->_reference);this->_reference=0;}}
Strictly speaking, the constructor should probably do the generation of the texture itself. Where I’m loading the texture is in another managed object of itself. Most importantly, if an exception is thrown during initialization, this class will remove anything allocated to it (if it did allocate).
It should be mentioned that there are lots of extra attributes we can pile into RAII style classes. There’s a really good write up (in depth) here.
Conclusion
RAII is a great idea to implement into your own classes. The more of this you can practice, the more exception-safe your code will become . . . from car accidents.