Cogs and Levers A blog full of technical stuff

Create a UDF for Hive with Scala

In today’s post, I’m going to walk through the basic process of creating a user defined function for Apache Hive using the Scala.

A quick _but important_ note: I needed to use the JDK 1.7 to complete the following. Using 1.8 saw errors that suggested that Hive on my distribution of Hadoop was not supported.

Setup your project

Create an sbt-based project, and start off adding the following to your project/assembly.sbt.

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")

What this had added is the sbt-assembly to your project. This allows you to bundle your scala application up as a fat JAR. When we issue the command sbt assemble at the console, we invoke this plugin to construct the fat JAR for us.

Now we fill out the build.sbt. We need to reference an external JAR, called hive-exec. This JAR is available by itself from the maven repository. I took a copy of mine from the hive distribution installed on my server. Anyway, it lands in the project’s lib folder.

name := "hive-udf"
version := "1.0"
scalaVersion := "2.11.1"
unmanagedJars in Compile += file("./lib/hive-exec-2.1.1.jar")

Write your function

Now it’s time to actually start writing some functions. In the following module, we’re just performing some basic string manipulation with trim, toUpperCase and toLowerCase. Each of which is contained in its own class, deriving from the UDF type:

scala/StringFunctions.scala

package me.tuttlem.udf

import org.apache.hadoop.hive.ql.exec.UDF

class TrimString extends UDF {
  def evaluate(str: String): String = {
    str.trim
  }
}

class UpperCaseString extends UDF {
  def evaluate(str: String): String = {
    str.toUpperCase
  }
}

class LowerCaseString extends UDF {
  def evaluate(str: String): String = {
    str.toLowerCase
  }
}

Now that we’ve written all of the code, it’s time to compile and assemble our JAR:

$ sbt assemble

To invoke

Copying across the JAR into an accessible place for hive is the first step here. Once that’s done, we can start up the hive shell and add it to the session:

ADD JAR /path/to/the/jar/my-udfs.jar;

Then, using the CREATE FUNCTION syntax, we can start to reference pieces of our module:

CREATE FUNCTION trim as 'me.tuttlem.udf.TrimString';
CREATE FUNCTION toUpperCase as 'me.tuttlem.udf.UpperCaseString';
CREATE FUNCTION toLowerCase as 'me.tuttlem.udf.LowerCaseString';

We can now use our functions:

hive> CREATE FUNCTION toUpperCase as 'me.tuttlem.udf.UpperCaseString';
OK
Time taken: 0.537 seconds
hive> SELECT toUpperCase('a test string');
OK
A TEST STRING
Time taken: 1.399 seconds, Fetched: 1 row(s)

hive> CREATE FUNCTION toLowerCase as 'me.tuttlem.udf.LowerCaseString';
OK
Time taken: 0.028 seconds
hive> SELECT toLowerCase('DON\'T YELL AT ME!!!');
OK
don't yell at me!!!
Time taken: 0.093 seconds, Fetched: 1 row(s)

That’s it!

Creating a Scala SBT project structure

Today’s post is going to be a tip on creating a project structure for your Scala projects that is SBT ready. There’s no real magic to it, just a specific structure that you can easily bundle up into a console application.

The shell script

To kick start your project, you can simple use the following shell script:

#!/bin/zsh
mkdir $1
cd $1

mkdir -p src/{main,test}/{java,resources,scala}
mkdir lib project target

echo 'name := "$1"
version := "1.0"
scalaVersion := "2.10.0"' > build.sbt

cd ..

This will give you everything that you need to get up an running. You’ll now have a structure like the following to work with:

.
├── build.sbt
├── lib
├── project
├── src
│   ├── main
│   │   ├── java
│   │   ├── resources
│   │   └── scala
│   └── test
│       ├── java
│       ├── resources
│       └── scala
└── target

Spellcheck for Sublime

Today’s post is a very quick tutorial on turning on spell check, for Sublime Text.

Looking at the documentation you can add the following:

"spell_check": true,
"dictionary": "Packages/Language - English/en_US.dic"

Easy.

Managing multiple SSH identities

Sometimes, it makes sense to have multiple SSH identites. This can certainly be the case if you’re doing work with your own personal accounts, vs. doing work for your job. You’re not going to want to use your work account for your personal stuff.

In today’s post, I’m going to run through the few steps that you need to take in order to manage multiple SSH identities.

Different identities

First up, we generate two different identities:

ssh-keygen -t rsa -C "user@work.com"

When asked, make sure you give the file a unique name:

Enter file in which to save the key (/home/michael/.ssh/id_rsa): ~/.ssh/id_rsa_work

Now, we create the identity for home.

ssh-keygen -t rsa -C "user@home.com"

Again, set the file name so they don’t collide:

Enter file in which to save the key (/home/michael/.ssh/id_rsa): ~/.ssh/id_rsa_home

Now, we should have the following:

id_rsa_home
id_rsa_home.pub
id_rsa_work
id_rsa_work.pub

Configuration

Now we create a configuration file that ties all of the identities up. Start editing ~/.ssh/config:

# Home account
Host home-server.com
  HostName home-server.com
  PreferredAuthentications publickey
  IdentityFile ~/.ssh/id_rsa_home

# Company account
Host work-server.com
  HostName work-server.com
  PreferredAuthentications publickey
  IdentityFile ~/.ssh/id_rsa_work

Delete all of the cached keys:

ssh-add -D

If you see the error message Could not open a connection to your authentication agent. you’ll need to run the following:

ssh-agent -s

Add your keys

You can now list your keys with:

ssh-add -l

You can add your keys back in with the following:

ssh-add ~/.ssh/id_rsa_work
ssh-add ~/.ssh/id_rsa_home

That’s it!

Futures and Promises in Scala

The wikipedia article for Futures and promises opens up with this paragraph, which I thought is the perfect definition:

In computer science, future, promise, delay, and deferred refer to constructs used for synchronizing program execution in some concurrent programming languages. They describe an object that acts as a proxy for a result that is initially unknown, usually because the computation of its value is yet incomplete.

In today’s article, I’ll walk you through the creation and management of the future and promise construct in the Scala language.

Execution context

Before continuing with the article, we need to make a special note about the ExecutionContext. Futures and promises both use the execution context to perform the execution of their computations.

Any of the operations that you’ll write out to start a computation requires an ExecutionContext as a parameter. These can be passed implicitly, so it’ll be a regular occurrence where you’ll see the following definition:

// define the implicit yourself
implicit val ec: ExecutionContext = ExecutionContext.global

// or - import one already defined
import ExecutionContext.Implicits.global

ExecutionContext.global is an ExecutionContext that is backed by a ForkJoinPool.

Futures

We create a Future in the following ways:

/* Create a future that relies on some work being done
   and that emits its value */
val getName = Future {
  // simulate some work here
  Thread.sleep(100)
  "John"
}

/* Create an already resolved future; no need to wait
   on the result of this one */
val alreadyGotName = Future.successful("James")

/* Create an already rejected future */
val badNews = Future.failed(new Exception("Something went wrong"))

With a future, you set some code in place to handle both the success and fail cases. You use the onComplete function to accomplish this:

getName onComplete {
  case Success(name) => println(s"Successfully got $name")
  case Failure(e) => e.printStackTrace()
}

Using a for-comprehension or map/flatMap, you can perform functional composition on your Future so that adds something extra through the pipeline. In this case, we’re going to prefix the name with a message should it start with the letter “J”:

val greeting = for {
  name <- getName
  if name.startsWith("J")
} yield s"Hello there, $name!"

Blocking

If you really need to, you can make your future block.

val blockedForThisName = Future {
  blocking {
    "Simon"
  }
}

Promises

The different between a Future and a Promise is that a future can be thought of as a read-only container. A promise is a single-assignment container that is used to complete a future.

Here’s an example.

val getNameFuture = Future { "Tom" }
val getNamePromise = Promise[String]()

getNamePromise completeWith getNameFuture

getNamePromise.future.onComplete {
  case Success(name) => println(s"Got the name: $name")
  case Failure(e) => e.printStackTrace()
}

getNamePromise has a future that we access through the future member. We treat it as usual with onComplete. It knows that it needs to resolve because of the completeWith call, were we’re telling getNamePromise to finish the getNameFuture future.