Cogs and Levers A blog full of technical stuff

Create a UDF for Hive with Scala

In today’s post, I’m going to walk through the basic process of creating a user defined function for Apache Hive using the Scala.

A quick _but important_ note: I needed to use the JDK 1.7 to complete the following. Using 1.8 saw errors that suggested that Hive on my distribution of Hadoop was not supported.

Setup your project

Create an sbt-based project, and start off adding the following to your project/assembly.sbt.

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")

What this had added is the sbt-assembly to your project. This allows you to bundle your scala application up as a fat JAR. When we issue the command sbt assemble at the console, we invoke this plugin to construct the fat JAR for us.

Now we fill out the build.sbt. We need to reference an external JAR, called hive-exec. This JAR is available by itself from the maven repository. I took a copy of mine from the hive distribution installed on my server. Anyway, it lands in the project’s lib folder.

name := "hive-udf"
version := "1.0"
scalaVersion := "2.11.1"
unmanagedJars in Compile += file("./lib/hive-exec-2.1.1.jar")

Write your function

Now it’s time to actually start writing some functions. In the following module, we’re just performing some basic string manipulation with trim, toUpperCase and toLowerCase. Each of which is contained in its own class, deriving from the UDF type:

scala/StringFunctions.scala

package me.tuttlem.udf

import org.apache.hadoop.hive.ql.exec.UDF

class TrimString extends UDF {
  def evaluate(str: String): String = {
    str.trim
  }
}

class UpperCaseString extends UDF {
  def evaluate(str: String): String = {
    str.toUpperCase
  }
}

class LowerCaseString extends UDF {
  def evaluate(str: String): String = {
    str.toLowerCase
  }
}

Now that we’ve written all of the code, it’s time to compile and assemble our JAR:

$ sbt assemble

To invoke

Copying across the JAR into an accessible place for hive is the first step here. Once that’s done, we can start up the hive shell and add it to the session:

ADD JAR /path/to/the/jar/my-udfs.jar;

Then, using the CREATE FUNCTION syntax, we can start to reference pieces of our module:

CREATE FUNCTION trim as 'me.tuttlem.udf.TrimString';
CREATE FUNCTION toUpperCase as 'me.tuttlem.udf.UpperCaseString';
CREATE FUNCTION toLowerCase as 'me.tuttlem.udf.LowerCaseString';

We can now use our functions:

hive> CREATE FUNCTION toUpperCase as 'me.tuttlem.udf.UpperCaseString';
OK
Time taken: 0.537 seconds
hive> SELECT toUpperCase('a test string');
OK
A TEST STRING
Time taken: 1.399 seconds, Fetched: 1 row(s)

hive> CREATE FUNCTION toLowerCase as 'me.tuttlem.udf.LowerCaseString';
OK
Time taken: 0.028 seconds
hive> SELECT toLowerCase('DON\'T YELL AT ME!!!');
OK
don't yell at me!!!
Time taken: 0.093 seconds, Fetched: 1 row(s)

That’s it!