Using PIG
20 Nov 2015Pig is a data mining and analysis language that you can use to reason about large data sets. The language works with Hadoop’s MapReduce framework to enable the language to crunch large datasets.
In today’s post, I’ll walk you through:
- Loading input data to HDFS
- Writing and Executing your Pig query
- Exploring output data sets
Get started
First of all, we need our source data. For the purposes of this article, I have a very simple data set consisting of 4 fields all sitting in a CSV format file. The file people.csv
then looks like this:
id | firstname | lastname | age |
---|---|---|---|
1 | John | Smith | 25 |
2 | Mary | Brown | 27 |
3 | Paul | Green | 21 |
4 | Sally | Taylor | 30 |
I’ll assume that your source file is sitting on a node in your Hadoop cluster that has access to HDFS. We now create an area for us to store our input data as well as upload our source data to HDFS with the following:
We now confirm that the data is actually sitting there, using the familliar ls
command:
HDFS should respond, showing peoeple.csv
. in place
Running your queries
Now that our source data has been deployed to HDFS and is available to us, we can fire up Pig. There are two modes that you can run Pig in:
- Local which will operate on local data and not submit map reduce jobs to complete its process
- MapReduce which will use the cluster to perform its work
The local
mode is quite handy in testing scenarios where your source data set is small and you’re just looking to test something out quickly. The mapreduce
mode is where your information needs to scale to the size of your cluster.
Now that you’ve got Pig started, you’ll be presented with the grunt>
prompt. It’s at this prompt that we can enter in our queries for processing. The following query will load our data set, extract the first (id) column and pump it into an output set.
The source data set is loaded into A
. B
then takes all of the values in the first column and writes these to id.out
.
Pig will now send your question (or query) off into the compute cluster in the form of a map reduce job. You’ll see the usual fanfare scrolling up the screen from the output of this job submission, and you should be able to follow along on the job control web application for your cluster.
Viewing the result
Once the query has finished its process, you’ll be able to take a look at the result. As this has invoked a map reduce job, you’ll be offered the familiar _SUCCESS
file in your output folder to illustrate that your query has run successfully.
You’ll also be given the result in the file part-m-00000
.
We can take a look at these results now:
This is a very simple example of how to run a Pig query on your Hadoop cluster. You can see how these ideas will scale with you as your dataset grows. The example query itself isn’t very complex by any stretch, so now that you know how to execute queries you can read up on Pig latin to tune your query writing craft.