Create a UDF for Hive with Scala
17 Feb 2017In today’s post, I’m going to walk through the basic process of creating a user defined function for Apache Hive using the Scala.
A quick _but important_ note: I needed to use the JDK 1.7 to complete the following. Using 1.8 saw errors that suggested that Hive on my distribution of Hadoop was not supported.
Setup your project
Create an sbt-based project, and start off adding the following to your project/assembly.sbt
.
What this had added is the sbt-assembly to your project. This allows you to bundle your scala application up as a fat JAR. When we issue the command sbt assemble
at the console, we invoke this plugin to construct the fat JAR for us.
Now we fill out the build.sbt
. We need to reference an external JAR, called hive-exec
. This JAR is available by itself from the maven repository. I took a copy of mine from the hive distribution installed on my server. Anyway, it lands in the project’s lib
folder.
Write your function
Now it’s time to actually start writing some functions. In the following module, we’re just performing some basic string manipulation with trim
, toUpperCase
and toLowerCase
. Each of which is contained in its own class, deriving from the UDF
type:
scala/StringFunctions.scala
Now that we’ve written all of the code, it’s time to compile and assemble our JAR:
To invoke
Copying across the JAR into an accessible place for hive is the first step here. Once that’s done, we can start up the hive shell and add it to the session:
Then, using the CREATE FUNCTION
syntax, we can start to reference pieces of our module:
We can now use our functions:
That’s it!