Pig is a data mining and analysis language that you can use to reason about large data sets. The language works with Hadoop’s MapReduce framework to enable the language to crunch large datasets.
First of all, we need our source data. For the purposes of this article, I have a very simple data set consisting of 4 fields all sitting in a CSV format file. The file people.csv then looks like this:
id
firstname
lastname
age
1
John
Smith
25
2
Mary
Brown
27
3
Paul
Green
21
4
Sally
Taylor
30
I’ll assume that your source file is sitting on a node in your Hadoop cluster that has access to HDFS. We now create an area for us to store our input data as well as upload our source data to HDFS with the following:
# make a directory under the /user/hadoop folder# to hold our data, called "demo"$ bin/hadoop fs -mkdir-p /user/hadoop/demo
# perform the upload into the "demo" folder$ bin/hadoop fs -put /src/people.csv /user/hadoop/demo
We now confirm that the data is actually sitting there, using the familliar ls command:
$ bin/hadoop fs -ls /user/hadoop/demo
HDFS should respond, showing peoeple.csv. in place
Now that our source data has been deployed to HDFS and is available to us, we can fire up Pig. There are two modes that you can run Pig in:
Local which will operate on local data and not submit map reduce jobs to complete its process
MapReduce which will use the cluster to perform its work
The local mode is quite handy in testing scenarios where your source data set is small and you’re just looking to test something out quickly. The mapreduce mode is where your information needs to scale to the size of your cluster.
# startup Pig so that it's in mapreduce mode$ pig -x mapreduce
Now that you’ve got Pig started, you’ll be presented with the grunt> prompt. It’s at this prompt that we can enter in our queries for processing. The following query will load our data set, extract the first (id) column and pump it into an output set.
grunt> A = load '/user/hadoop/demo/people.csv' using PigStorage(',');
grunt> B = foreach A generate $0 as id;
grunt> store B into 'id.out';
The source data set is loaded into A. B then takes all of the values in the first column and writes these to id.out.
Pig will now send your question (or query) off into the compute cluster in the form of a map reduce job. You’ll see the usual fanfare scrolling up the screen from the output of this job submission, and you should be able to follow along on the job control web application for your cluster.
Viewing the result
Once the query has finished its process, you’ll be able to take a look at the result. As this has invoked a map reduce job, you’ll be offered the familiar _SUCCESS file in your output folder to illustrate that your query has run successfully.
$ bin/hadoop fs -ls id.out
You’ll also be given the result in the file part-m-00000.
This is a very simple example of how to run a Pig query on your Hadoop cluster. You can see how these ideas will scale with you as your dataset grows. The example query itself isn’t very complex by any stretch, so now that you know how to execute queries you can read up on Pig latin to tune your query writing craft.
SQL Server and PostgreSQL are both relational database systems and as such share similarities that should allow you to migrate your data structures between them. In today’s post, I’m going to go through the process of migrating SQL Server data schema objects over to PostgreSQL.
Immediate differences
Whilst both relational database systems implement a core set of the standard language, there are implementation-specific features which need special consideration. So long as you are capable of wrangling text in your favorite editor, the conversion task shouldn’t be that hard.
The batch terminator GO gets replaces by a much more familiar ;.
Tables
First thing to do for tables is to generate your create scripts. Make sure that you:
Turn offDROP statement generation for your objects
Turn on index and keys generation
To safely qualify the names of objects within the database, SQL Server will surround its object names with square brackets [], so you’ll see definitions like this:
-- generated from SQL ServerCREATETABLE[dbo].[Table1]([ID]INTIDENTITY(1,1)NOTNULL,[Name]VARCHAR(50)NOTNULLCONSTRAINT[PK_Table1]PRIMARYKEYCLUSTERED([ID]ASC))
PostgreSQL uses double-quotes on object names and doesn’t use the owner (in the above case [dbo]) to qualify names.
In the above example, Table1 is using IDENTITY on its primary key field ID. This gives us the auto-increment functionality that’s so natural in relational database systems. There is a little extra work in PostgreSQL to emulate this behavior through the use of CREATE SEQUENCE and nextval.
The SQL Server definition above now looks like this for PostgreSQL:
-- migrated for PostgreSQLCREATESEQUENCETable1Serial;CREATETABLETable1(IDINTNOTNULLDEFAULTnextval('Table1Serial'),NameVARCHAR(50)NOTNULL,CONSTRAINTPK_Table1PRIMARYKEY(ID));
Stored Procedures
Stored procedures in SQL Server are considered a much more common citizen in the database world than Stored procedures in PostgreSQL. If your database design hinges on extensive use of stored procedures, you’ll be in for a bit of redevelopment.
Both stored procedures and functions are created using the same syntax in PostgreSQL. The actions that either can perform differ though:
Stored Procedure
Function
Used in an expression
No
Yes
Return a value
No
Yes
Output parameters
Yes
No
Return result set
Yes
Yes
Multiple result sets
Yes
No
A simple function that will square its input value looks as follows:
A more in-depth example involves returning a result set from within a stored procedure. You can do this in an unnamed fashion; you won’t control the name of the cursor coming back.
There are quite a few times where I’ve run a command on a remote machine and needed to get out of that machine but leave my command running.
I’ll normally start a job that I know is going to take a while using an ampersand like so:
$ long-running-prog &
Really, the nohup command should also be put on the command line so that the command that you execute will ignore the signal SIGHUP.
$ nohup long-running-prog &
$ exit
If you’re already part-way through a running process, you can get it to continue running in the background (while you make your getaway) by doing the following
$ long-running-prog
CTRL-Z
$ bg$ disown pid
You use CTRL-Z to suspend the running process. The bg command then gets the program running in the background. You can confirm that it is running in the background with the jobs command. Lastly, using disown detatches the process running in the background from your terminal, so that when you exit your session the process will continue.
It’s possible to use the Watcom compiler to mix your code with modules compiled (or in this article’s case, assembled) with other tools. In today’s post, I’ll take you through the simple process of creating a module using Borland’s Turbo Assembler and linking it with a simple C program.
Creating a test
First thing to do, is to create an assembly module that we can integrate with. For this module, we’re going to take two numbers; add them together and send out the result.
; adder.asm
;
; Assembly module to add two numbers
.386p
.MODEL FLAT
_TEXT SEGMENT DWORD PUBLIC 'CODE'
ASSUME CS:_TEXT
PUBLIC add_numbers
add_numbers PROC NEAR
push ebp
mov ebp, esp
ARG A:DWORD, B:DWORD
mov eax, [A]
mov ecx, [B]
add eax, ecx
mov esp, ebp
pop ebp
ret
add_numbers ENDP
_TEXT ENDS
END
This is a basic module, with most of the stack-balancing work being handled for us by the ARG directive. From the documentation:
ARG is used inside a PROC/ENDP pair to help you link to high level languages. Using ARG, you can access arguments pushed onto the stack.
Also from the documentation:
In the code that follows, you can now refer to PAR1 or PAR2, and the correct [BP + n] expression will be substituted automatically by the assembler.
Of course, we could have just as easily used the following without needing the ARG directive:
mov eax, [ebp + 12]
mov ecx, [ebp + 8]
In accordance with the 32bit ABI, we put the result in EAX at the end of execution. Producing an object file from this assembly source is relatively easy:
C:\SRC> tasm /mx /zi /os adder.asm adder.obj
Integrating with the module
Now that we’ve got an object file with our function in it, we’ll create a very small, simple C program that will use this function. In order to do so though, we need to declare the function as an extern; as it is implemented externally to our C code:
Because we’re using C, there’s no need to really decorate the function prototype of add_numbers. Had we been compiling a C++ module, this declaration would need to change slightly to attribute the calling convention:
extern"C"{intadd_numbers(inta,intb);}
This module is now ready to be compiled itself and linked to the assembly implementation. We can achieve this with wcc386 and wlink to tie it all together for us.
C:\SRC> wcc386 /d2 /3s test.c
C:\SRC> wlink system dos4g file test,adder name test
From there, we have a linked and read to run executable test.exe.
Docker provides a very convenient way of packaging your applications and their dependencies so that they can be moved around without too much effort. Another great side-effect of this type of system design is the isolation that you’re given between containers. In today’s post, I’ll walk through the setup of Google Chrome running in an isolated sandbox within Docker and so that it’s nicely integrated into Ubuntu.
Not starting from zero
I have to admit, most of the hard work had already been done for me in Jessie Frazelle’s post about hosting desktop applications in docker containers. The Chrome Dockerfile that I have hosted in my github repository is a pretty good rip, directly from Jessie’s post.
Getting started
Putting together a run script that you can repeatedly call from the operating system shouldn’t be too hard. It only needed to do three things:
Create a container when one didn’t exist
Start the container if it already existed
Open a new window if the container was already started
This is a relatively simple bash script to do this:
We want to check if the container is running, first up. I’ve standardised by calling the container chrome. Really creative. Upon successful return from the docker inspect command, the $CHROME_RUNNING variable should either be true or false. If the inspect call didn’t go to plan, it’s most likely because the container doesn’t exist and we need to use run to kick it into gear:
This gets the container up and running and the browser under our noses.
In cases where the container already exists, but isn’t running we’ll use run. When the container exists and it is running, the only reason why someone could be invoking this script is to get another browser window running; so we’ll use exec to get chrome to open up a new window for us:
By using the $@ variable in the exec script, we can take in any web address that’s passed into this script. This is what will allow us to integrate this container into our operating system.
Integration
We’ve done just about everything now with the run script. I’ve created myself a menu item with a chrome icon that just points to this run script:
The main key binding of Super + W, but the most important is changing the preferred browser so that it invokes the script. %s passes the desired web site through for a seamless finish.