Cogs and Levers A blog full of technical stuff

Using PIG

Pig is a data mining and analysis language that you can use to reason about large data sets. The language works with Hadoop’s MapReduce framework to enable the language to crunch large datasets.

In today’s post, I’ll walk you through:

  • Loading input data to HDFS
  • Writing and Executing your Pig query
  • Exploring output data sets

Get started

First of all, we need our source data. For the purposes of this article, I have a very simple data set consisting of 4 fields all sitting in a CSV format file. The file people.csv then looks like this:

id firstname lastname age
1 John Smith 25
2 Mary Brown 27
3 Paul Green 21
4 Sally Taylor 30

I’ll assume that your source file is sitting on a node in your Hadoop cluster that has access to HDFS. We now create an area for us to store our input data as well as upload our source data to HDFS with the following:

# make a directory under the /user/hadoop folder
# to hold our data, called "demo"
$ bin/hadoop fs -mkdir -p /user/hadoop/demo

# perform the upload into the "demo" folder
$ bin/hadoop fs -put /src/people.csv /user/hadoop/demo

We now confirm that the data is actually sitting there, using the familliar ls command:

$ bin/hadoop fs -ls /user/hadoop/demo

HDFS should respond, showing peoeple.csv. in place

Found 1 items
-rw-r--r--   1 root supergroup         80 2015-11-10 06:27 /user/hadoop/demo/people.csv

Running your queries

Now that our source data has been deployed to HDFS and is available to us, we can fire up Pig. There are two modes that you can run Pig in:

  • Local which will operate on local data and not submit map reduce jobs to complete its process
  • MapReduce which will use the cluster to perform its work

The local mode is quite handy in testing scenarios where your source data set is small and you’re just looking to test something out quickly. The mapreduce mode is where your information needs to scale to the size of your cluster.

# startup Pig so that it's in mapreduce mode
$ pig -x mapreduce

Now that you’ve got Pig started, you’ll be presented with the grunt> prompt. It’s at this prompt that we can enter in our queries for processing. The following query will load our data set, extract the first (id) column and pump it into an output set.

grunt> A = load '/user/hadoop/demo/people.csv' using PigStorage(',');
grunt> B = foreach A generate $0 as id;
grunt> store B into 'id.out';

The source data set is loaded into A. B then takes all of the values in the first column and writes these to id.out.

Pig will now send your question (or query) off into the compute cluster in the form of a map reduce job. You’ll see the usual fanfare scrolling up the screen from the output of this job submission, and you should be able to follow along on the job control web application for your cluster.

Viewing the result

Once the query has finished its process, you’ll be able to take a look at the result. As this has invoked a map reduce job, you’ll be offered the familiar _SUCCESS file in your output folder to illustrate that your query has run successfully.

$ bin/hadoop fs -ls id.out

You’ll also be given the result in the file part-m-00000.

Found 2 items
-rw-r--r--   1 root supergroup          0 2015-11-10 06:54 id.out/_SUCCESS
-rw-r--r--   1 root supergroup         31 2015-11-10 06:54 id.out/part-m-00000

We can take a look at these results now:

$ bin/hadoop fs -cat id.out/part-m-00000

This is a very simple example of how to run a Pig query on your Hadoop cluster. You can see how these ideas will scale with you as your dataset grows. The example query itself isn’t very complex by any stretch, so now that you know how to execute queries you can read up on Pig latin to tune your query writing craft.