A closer look at Hive
05 Dec 2015In a previous post we went through a fairly simple example of how to get up and running quickly with Apache Hive. In today’s post I’ll take a deeper dive a look a little closer at the different aspects of using it.
Everything that I mention in this article can be found in the language manual on the Apache wiki.
For the examples that are listed in this blogpost, I’m using data that can be downloaded from the FAA site.
Databases
Your first job, much the same with any database system is to create a database.
You can also use EXISTS
in your creation and destruction statements to ensure something is or isn’t there.
Tables
To create a table that’s managed by the hive warehouse, we can use the following.
This table can then be filled with data that is sourced locally:
You can also create an external table using the following syntax:
You can see that this has used a file hosted on HDFS as the data source. The idea is that the existing file (that we’d specified in the LOCATION
statement) will now be accessible to hive through this table.
From the wiki:
The
EXTERNAL
keyword lets you create a table and provide aLOCATION
so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping anEXTERNAL
table, data in the table is NOT deleted from the file system.
An
EXTERNAL
table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
It’s important to note that when you DROP
an external table, the underlying data is NOT deleted.
Views
You can provide a more targeted representation of your data to you users by offering them views. Views allow you to also specify aggregate functions as columns. In the following view, we simple retrieve all of the countries that an airport is located; along with the number of airports located in that country.
Partitions and Buckets
Because you’ll be working with very large data sets, Hive offers you the ability to partition data on columns that you nominate. These partitions are then broken down even further with into buckets.
From the wiki:
Partitioned tables can be created using the
PARTITIONED BY
clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed usingCLUSTERED BY
columns, and data can be sorted within that bucket viaSORT BY
columns. This can improve performance on certain kinds of queries.
So this technique does change the way data is physically structured on disk. It tried to structure it in such a way that it’ll bias towards the performance of the queries that you’re running. Of course, this is up to you as you need to define which fields to partition and cluster by.
Here’s the airports
table, partitioned by country
.
When this table gets clustered into buckets, the database developer needs to specify the number of buckets to possible distribute across. From here, hive will make decisions on which bucket to place the data into with the following formula:
We then create and fill the bucketed store like so: