We are going to feature a technical article once a week by eSage’s very own,J’son Cannelos, Partner and Principal Architect. Check back every week for another Tech Talk with J’son. Have a question? Post a comment and he will be happy to answer.
Hadoop and a Beginning MapReduce Program
With all the hoopla about Hadoop lately, I’d like to discuss some of the components of MapReduce and how they are used to parse and process unstructured data in HDFS (Hadoop Distributed File System). Today, I will be discussing just the beginning of how a MapReduce program is built and run.
MapReduce is the primary Apache interface and programming system for processing data stored in HDFS. Unstructured data, like web logs, go in one end and data with more meaning (ahem, structure), come out the other. Many details that would have to be coded and accounted for manually, such as retrieving data from the correct HDFS node and uncompressing input data, are handled behind the scenes for you so you can focus on what you really want to do with your data. Even recent Apache toolsets like Pig and Hive, which make this type of processing available to the non-Java set, translate their scripts into MapReduce behind the scenes to crunch your data.
Building a MapReduce program begins by first declaring a class that inherits from the org.apache.hadoop.conf.Configured class:
imports org.apache.hadoop.conf;
imports org.apache.hadoop.util;
public class BeingMapReduce extends Configured implements Tool
{
@Override
public int run(String[] args) throws Exception {
//error handling if you are expecting certain # of args (return -1)
//JobConf setup below
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int retVal = ToolRunner.run(new BeginMapReduce(), args);
System.exit(retVal);
}
}
Tool is a helper interface. Along with ToolRunner, it helps ensure that all the default Hadoop arguments are used and allows you to concentrate on any custom arguments that you would like to set at runtime.
Configured is the main door into a MapReduce program and gives you access to the all-important JobConf object. This is the main configuration object for your “job”. Here you will define the classes that represent the Mapper and the Reducer (and Combiner, Partitioner, et all – more on those in another blog post). You can get a default JobConf by calling the following:
JobConf conf = new JobConf(getConf(), getClass());
If you had a lot of MapReduce programs that use the same settings, getConf() could actually be fetched from a static class. Since most of our MR programs use the same input / output format classes and arguments, we have a separate jar called Commons that simply hands us a JobConf with most of the arguments set for us.
The JobConf object is also where you specify just where your input data resides, where you want the output data to go when finished, and what form the data is in. A basic setup:
FileInputFormat.addInputPath(conf, new Path(“/data/moredata/20110404/*.log”));
FileInputFormat.addOutputPath(conf, new Path(“/data/outputs”));
The hardcoded paths I used above point to where the data is located in a typical HDFS setup. A more flexible option would be to make these variables (arguments) that are passed in off the command line (args[0], args[1], et). While it’s not very useful, the above code will pretty much run even without specifying a Mapper and Reducer class! That’s because ToolRunner has specified default Mapper and Reducer classes for us – IdentityMapper.class and IdentityReducer.class. More on these and how they work in a future posting.
That’s it for today. I hope this helps get you started in your exploration of Hadoop and MapReduce!