In today’s post, I’m going to walk through a simple SOAPweb service creation using maven, jax-ws for java. The service will be hosted inside of Apache Tomcat once we’re up and running.
Maven
First off, we start the application off with maven.
This creates our project structure and puts all of the project dependencies in place. The pom.xml that gets generated for us needs a little extra help for a JAX-WS project. We need to:
We now write our service implementation. For this purposes of this article will be very simple. I took over the pre-generated App.java and renamed it for my purposes to HelloService.java.
We instruct the jaxws framework that we have a service listening at any particular given endpoint by use of the sun-jaxws.xml file. Create this file in src/main/webapp/WEB-INF. It should look like this:
To let Tomcat know from a deployment perspective what this application will handle, we also create a web.xml file that will be located in the same directory, src/main/webapp/WEB-INF. It looks like this:
Now that the service is up and running, we really want to test it to make sure it’s working. SOAP requests are HTTP POSTS. Sending the following request:
In a previous post we went through a fairly simple example of how to get up and running quickly with Apache Hive. In today’s post I’ll take a deeper dive a look a little closer at the different aspects of using it.
For the examples that are listed in this blogpost, I’m using data that can be downloaded from the FAA site.
Databases
Your first job, much the same with any database system is to create a database.
hive> CREATE DATABASE first;
OK
Time taken: 0.793 seconds
hive> USE first;
OK
Time taken: 0.037 seconds
You can also use EXISTS in your creation and destruction statements to ensure something is or isn’t there.
hive> CREATE DATABASE IF NOT EXISTS first;
OK
Time taken: 0.065 seconds
hive> DROP DATABASE IF EXISTS first;
OK
Time taken: 0.26 seconds
Tables
To create a table that’s managed by the hive warehouse, we can use the following.
hive> CREATE TABLE airports (
> iata STRING, airport STRING, city STRING,
> state STRING, country STRING,
> lat DECIMAL, long DECIMAL
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
OK
Time taken: 0.324 seconds
This table can then be filled with data that is sourced locally:
hive> LOAD DATA LOCAL INPATH '/srv/airports.csv'
> OVERWRITE INTO TABLE airports;
Loading data to table faa.airports
Table faa.airports stats: [numFiles=1, numRows=0, totalSize=244383, rawDataSize=0]
OK
Time taken: 1.56 seconds
You can also create an external table using the following syntax:
hive> CREATE EXTERNAL TABLE carriers (
> code STRING, description STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
> LOCATION '/user/root/carriers';
OK
Time taken: 0.408 seconds
You can see that this has used a file hosted on HDFS as the data source. The idea is that the existing file (that we’d specified in the LOCATION statement) will now be accessible to hive through this table.
From the wiki:
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
It’s important to note that when you DROP an external table, the underlying data is NOT deleted.
Views
You can provide a more targeted representation of your data to you users by offering them views. Views allow you to also specify aggregate functions as columns. In the following view, we simple retrieve all of the countries that an airport is located; along with the number of airports located in that country.
hive> CREATE VIEW airports_per_country_vw
> AS
> SELECT country, COUNT(*) AS country_count
> FROM airports
> GROUP BY country;
OK
Time taken: 0.134 seconds
Partitions and Buckets
Because you’ll be working with very large data sets, Hive offers you the ability to partition data on columns that you nominate. These partitions are then broken down even further with into buckets.
From the wiki:
Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
So this technique does change the way data is physically structured on disk. It tried to structure it in such a way that it’ll bias towards the performance of the queries that you’re running. Of course, this is up to you as you need to define which fields to partition and cluster by.
Here’s the airports table, partitioned by country.
hive> CREATE EXTERNAL TABLE airport_part_by_country (
> iata STRING, airport STRING, city STRING,
> state STRING, lat DECIMAL, long DECIMAL
> ) PARTITIONED BY (country STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
> LOCATION '/user/root/partitioned';
OK
Time taken: 0.128 seconds
When this table gets clustered into buckets, the database developer needs to specify the number of buckets to possible distribute across. From here, hive will make decisions on which bucket to place the data into with the following formula:
We then create and fill the bucketed store like so:
-- create the bucketed store
hive> CREATE EXTERNAL TABLE airports_b (
> iata string, airport string, city string,
> state string, lat decimal, long decimal
> ) PARTITIONED BY (country string)
> CLUSTERED BY (state) INTO 100 BUCKETS;
-- fill the bucketed store
hive> set hive.enforce.bucketing = true;
hive> FROM airports
> INSERT OVERWRITE TABLE airports_b
> PARTITION (country='USA')
> SELECT iata, airport, city, state, lat, long;
The following post is a quick guide to getting around the nmap network administration and security tool.
General scanning
Scanning with nmap gives you the insight into what is available to a server (from an external user’s perspective). Information about the techniques that nmap will use can be found here.
# cloak a scan with decoys
nmap -n-Ddecoy1.example.com,decoy2.example.com 192.168.0.1
# scan with a spoofed mac address
nmap --spoof-mac MAC-ADDRESS-HERE 192.168.0.1
# scan with a random mac address
nmap -v-sT-PN--spoof-mac 0 192.168.0.1
In today’s post, I’m going to refresh the information in some previousarticles that I have written to bring them up to date.
Job, Mapper and Reducer
It’s pretty easy to bundle your job, mapper and reducer together. If they’re small enough, it makes sense to do so.
publicclassMyJobJobextendsConfiguredimplementsTool{publicstaticclassMyMapperextendsMapper<LongWritable,Text,Text,Text>{@Overridepublicvoidsetup(Contextcontext){/* setup any configs from the command line */this.val=context.getConfiguration().get("some.value");}publicvoidmap(LongWritablekey,Textvalue,Contextcontext)throwsIOException,InterruptedException{/* data selection and source filtering here */}}publicstaticclassMyReducerextendsReducer<Text,Text,NullWritable,Text>{publicvoidreduce(Textkey,Iterable<Text>values,Contextcontext)throwsIOException,InterruptedException{/* data aggregators here */}}}
There isn’t much that has changed here. Reading the type annotations can be a little hairy. You can always lookup the documentation for Mapper and for Reducer. The type definitions are uniform:
The ToolRunner class simplifies the execution management of a MapReduce job using the interface, Tool. Tool is a very simple interface, only providing implementing classes with a contract to run. The run method looks like this:
args are supplied as usual from the main method of the job. A typical implementation of the run method will retrieve configuration information, setup the job and execute.
publicintrun(String[]allArgs)throwsException{Jobjob=Job.getInstance(getConf());job.setJarByClass(MyJob.class);// basic I/O shape setup job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);job.setOutputKeyClass(NullWritable.class);job.setOutputValueClass(Text.class);// map, combine, partition, reduce setup job.setMapperClass(MyMapper.class);job.setCombinerClass(MyCombiner.class);job.setReducerClass(MyReducer.class);job.setNumReduceTasks(1);// parse options passed to the job String[]args=newGenericOptionsParser(getConf(),allArgs).getRemainingArgs();// set the files (from arguments)FileInputFormat.setInputPaths(job,newPath(args[0]));FileOutputFormat.setOutputPath(job,newPath(args[1]));// wait for the jobs to finishbooleanstatus=job.waitForCompletion(true);returnstatus?0:1;}
A special part of the magic here is wrapped up in the GenericOptionsParser which takes in the standard set of command line parameters and plumbs them directly into the job’s configuration.
Finishing up
So there are a couple of features that are provided for you with this wrapper around the run function. Your main method ends up very simple: