Cogs and Levers A blog full of technical stuff

A closer look at Hive

In a previous post we went through a fairly simple example of how to get up and running quickly with Apache Hive. In today’s post I’ll take a deeper dive a look a little closer at the different aspects of using it.

Everything that I mention in this article can be found in the language manual on the Apache wiki.

For the examples that are listed in this blogpost, I’m using data that can be downloaded from the FAA site.

Databases

Your first job, much the same with any database system is to create a database.

hive> CREATE DATABASE first;
OK
Time taken: 0.793 seconds

hive> USE first;
OK
Time taken: 0.037 seconds

You can also use EXISTS in your creation and destruction statements to ensure something is or isn’t there.

hive> CREATE DATABASE IF NOT EXISTS first;
OK
Time taken: 0.065 seconds

hive> DROP DATABASE IF EXISTS first;
OK
Time taken: 0.26 seconds

Tables

To create a table that’s managed by the hive warehouse, we can use the following.

hive> CREATE TABLE airports (
    > iata STRING, airport STRING, city STRING, 
    > state STRING, country STRING, 
    > lat DECIMAL, long DECIMAL
    > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
OK
Time taken: 0.324 seconds

This table can then be filled with data that is sourced locally:

hive> LOAD DATA LOCAL INPATH '/srv/airports.csv' 
    > OVERWRITE INTO TABLE airports;
Loading data to table faa.airports
Table faa.airports stats: [numFiles=1, numRows=0, totalSize=244383, rawDataSize=0]
OK
Time taken: 1.56 seconds

You can also create an external table using the following syntax:

hive> CREATE EXTERNAL TABLE carriers ( 
    > code STRING, description STRING
    > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," 
    > LOCATION '/user/root/carriers';
OK
Time taken: 0.408 seconds

You can see that this has used a file hosted on HDFS as the data source. The idea is that the existing file (that we’d specified in the LOCATION statement) will now be accessible to hive through this table.

From the wiki:

The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.

An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.

It’s important to note that when you DROP an external table, the underlying data is NOT deleted.

Views

You can provide a more targeted representation of your data to you users by offering them views. Views allow you to also specify aggregate functions as columns. In the following view, we simple retrieve all of the countries that an airport is located; along with the number of airports located in that country.

hive> CREATE VIEW airports_per_country_vw
    > AS
    > SELECT country, COUNT(*) AS country_count 
    > FROM airports 
    > GROUP BY country;
OK
Time taken: 0.134 seconds

Partitions and Buckets

Because you’ll be working with very large data sets, Hive offers you the ability to partition data on columns that you nominate. These partitions are then broken down even further with into buckets.

From the wiki:

Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.

So this technique does change the way data is physically structured on disk. It tried to structure it in such a way that it’ll bias towards the performance of the queries that you’re running. Of course, this is up to you as you need to define which fields to partition and cluster by.

Here’s the airports table, partitioned by country.

hive> CREATE EXTERNAL TABLE airport_part_by_country (
    > iata STRING, airport STRING, city STRING, 
    > state STRING, lat DECIMAL, long DECIMAL
    > ) PARTITIONED BY (country STRING) 
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY "," 
    > LOCATION '/user/root/partitioned';
OK
Time taken: 0.128 seconds

When this table gets clustered into buckets, the database developer needs to specify the number of buckets to possible distribute across. From here, hive will make decisions on which bucket to place the data into with the following formula:

target_bucket = hash_value(bucket_column) % bucket_count

We then create and fill the bucketed store like so:

-- create the bucketed store
hive> CREATE EXTERNAL TABLE airports_b (
    > iata string, airport string, city string, 
    > state string, lat decimal, long decimal
    > ) PARTITIONED BY (country string) 
    > CLUSTERED BY (state) INTO 100 BUCKETS;

-- fill the bucketed store
hive> set hive.enforce.bucketing = true;
hive> FROM airports 
    > INSERT OVERWRITE TABLE airports_b 
    > PARTITION (country='USA') 
    > SELECT iata, airport, city, state, lat, long;

nmap Cheatsheet

The following post is a quick guide to getting around the nmap network administration and security tool.

General scanning

Scanning with nmap gives you the insight into what is available to a server (from an external user’s perspective). Information about the techniques that nmap will use can be found here.

# scan a host by ip/name
nmap 192.168.0.1
nmap host.example.com

# scan multiple hosts
nmap 192.168.0.1 192.168.0.2 192.168.0.3
nmap 192.168.0.1,2,3

# range scanning
nmap 192.168.0.1-3
nmap 192.168.0.*

# subnet scanning
nmap 192.168.0.0/24

Utilities

Command Description
nmap -v -A 192.168.0.1 Turn on OS and version detection
nmap -sA 192.168.0.1 Check for a firewall
nmap -PN 192.168.0.1 Scan a firewall protected host
nmap -6 ::1 Scan IPv6 address
nmap -sP 192.168.0.1/24 Check for alive hosts
nmap --reason 192.168.0.1 Document the reason for a service discovery
nmap --open 192.168.0.1 Show open ports
nmap --packet-trace 192.168.0.1 Show packet trace (sent/received)
nmap --iflist Show host interface and routes
nmap -O 192.168.0.1 Detect remote operating system
nmap -sV 192.168.0.1 Detect remote service/daemon version
nmap -sO 192.168.0.1 Scan for IP protocol

Port scans

Command Description
nmap -p 80 192.168.0.1 Scan http
nmap -p T:80 192.168.0.1 Scan tcp/http
nmap -p U:53 192.168.0.1 Scan udp/dns

Firewalls

The following commands scan firewalls for weaknesses

# tcp null scan
nmap -sN 192.168.0.1

# tcp fin scan
nmap -sF 192.168.0.1

# tcp xmas scan
nmap -sX 192.168.0.1

# scan a firewall for packet fragments
nmap -f 192.168.0.1

Spoof

# cloak a scan with decoys
nmap -n -Ddecoy1.example.com,decoy2.example.com 192.168.0.1

# scan with a spoofed mac address
nmap --spoof-mac MAC-ADDRESS-HERE 192.168.0.1

# scan with a random mac address
nmap -v -sT -PN --spoof-mac 0 192.168.0.1

Hadoop job setup

In today’s post, I’m going to refresh the information in some previous articles that I have written to bring them up to date.

Job, Mapper and Reducer

It’s pretty easy to bundle your job, mapper and reducer together. If they’re small enough, it makes sense to do so.

public class MyJobJob 
  extends Configured implements Tool {

  public static class MyMapper 
    extends Mapper<LongWritable, Text, Text, Text> {

    @Override 
    public void setup(Context context) {
      /* setup any configs from the command line */
      this.val = context.getConfiguration().get("some.value");
    }    

    public void map(LongWritable key, Text value, Context context) 
      throws IOException, InterruptedException {      
        /* data selection and source filtering here */
    }
    
  }
  
  public static class MyReducer 
    extends Reducer<Text, Text, NullWritable, Text> {
    
    public void reduce(Text key, Iterable<Text> values, Context context) 
      throws IOException, InterruptedException {
        /* data aggregators here */
    }
    
  }
}

There isn’t much that has changed here. Reading the type annotations can be a little hairy. You can always lookup the documentation for Mapper and for Reducer. The type definitions are uniform:

class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

ToolRunner

The ToolRunner class simplifies the execution management of a MapReduce job using the interface, Tool. Tool is a very simple interface, only providing implementing classes with a contract to run. The run method looks like this:

public abstract int run(java.lang.String[] args) 
  throws java.lang.Exception;

args are supplied as usual from the main method of the job. A typical implementation of the run method will retrieve configuration information, setup the job and execute.

public int run(String[] allArgs) throws Exception {
  
  Job job = Job.getInstance(getConf());

  job.setJarByClass(MyJob.class);

  // basic I/O shape setup 
  job.setInputFormatClass(TextInputFormat.class);
  job.setOutputFormatClass(TextOutputFormat.class);
  job.setOutputKeyClass(NullWritable.class);
  job.setOutputValueClass(Text.class);

  // map, combine, partition, reduce setup 
  job.setMapperClass(MyMapper.class);
  job.setCombinerClass(MyCombiner.class);
  job.setReducerClass(MyReducer.class);
  job.setNumReduceTasks(1);
  
  // parse options passed to the job      
  String[] args = new GenericOptionsParser(
    getConf(), allArgs
  ).getRemainingArgs();
  
  // set the files (from arguments)
  FileInputFormat.setInputPaths(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));
  
  // wait for the jobs to finish
  boolean status = job.waitForCompletion(true);
  return status ? 0 : 1;
}

A special part of the magic here is wrapped up in the GenericOptionsParser which takes in the standard set of command line parameters and plumbs them directly into the job’s configuration.

Finishing up

So there are a couple of features that are provided for you with this wrapper around the run function. Your main method ends up very simple:

public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  ToolRunner.run(new MyJob(), args);
}

Your job is then invoked from the command line using the hadoop command:

$HADOOP_PREFIX/bin/hadoop jar my-jobs-0.0.1-SNAPSHOT.jar \
  org.example.MyMapReduceJob \
  -D arbitrary.config.value=xyz \
  /user/root/input-file.csv \
  /user/root/output-dir

Make a symbolic link

As a small reminder to myself on creating symbolic links

ls -s /path/to/the/thing /path/to/the/link

Encoding information in prime numbers

An interesting part of encryption theory is the ability to encode a message using prime numbers. It’s not the most efficient way to represent a message, but it does exhibit some interesting properties.

Hello

Take the message “HELLO” for instance. Here it is along with the ASCII values for each character.

H  E  L  L  O
72 69 76 76 79 

If we assign each character of our message a prime (as they ascend in sequence):

2  3  5  7  11
H  E  L  L  O
72 69 76 76 79 

We can encode this message using these prime numbers like so:

(2^72) * (3^69) * (5^76) * (7^76) * (11^79) =

1639531486723067852359816964623169016543137549
4122401687192804219102815235735638642399170444
5066082282398711507312101674742952521828622795
1778467808618104090241918575825850806280956250
0000000000000000000000000000000000000000000000
0000000000000000000000000 

That massive number is our encoded message.

Adjusting the message

You can add a letter to the message, just by multiplying in another value:

 H        E        L        L        O         O
(2^72) * (3^69) * (5^76) * (7^76) * (11^79) * (13^79) 

Commutatively, we can remove a character from our message just by dividing the encoded message. To remove the E from our message, we’d divide the encoded message by 3^69.

The guessing game

As there’s no encryption involved with this process, it’s purely encoding; all someone needs to do is factor out your message. From there they can gain the ASCII codes and positions to be able to read your message.