Mining data with Hive
21 Nov 2015Apache Hive is a database analytics technology that can be used to mine, structured, well formatted data. From the website:
The Apache Hive™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
In today’s post, I’m going to walk through getting up and running to your first query with Hive.
CSV
Probably the easiest place to start, is a CSV file. Information in the file has its fields terminated by a comma ,
and lines by a newline \n
. The example that I’ll use to day contains the following data:
Before Hive can get its hands on this information, we’ll need to make it available to cluster by uploading it to HDFS.
Now we can startup Hive and create the table structure that we’ll be working with.
There’s a fair bit in this data definition. The full documentation on Hive’s DDL can be found here. There are so many ways that you can accomplish things, and the example that I’ve listed is very simple.
ROW FORMAT DELIMITED
tells Hive to use the default SerDe. We could have specified a regular expression here to interpret a line of the file, or specified our own custom SerDe but because this is so standard we only needed a field delimiter which is denoted by the FIELDS TERMINATED BY
. There is also a LINES TERMINATED BY
should you need to specify something other than \n
as the terminator.
STORED AS TEXTFILE
is the default. Our data is being stored in textfiles. Finally, TBLPROPERTIES
allows arbitrary information to be applied to the create. We just wanted to tell the table that the first line in the files that it’ll encounter should be discarded as it’s the header line.
Load in the data
Now that we’ve built a data structure, we can now put some data in it.
We’re good to run that first query now!
Once we involve aggregates, these queries start to get submitted at MapReduce jobs:
Next steps
The examples page on the Hive site has some more complex data definitions, including being able to specify your own SerDe using python as well as processing an Apache web server log.