oozie action type : map-reduce

Sankar Cheppali | August 25, 2016 | Hadoop, Java | No Comments

Most common action type you will find in oozie workflow is <map-reduce> action type. In this blog we will see how to define a map-reduce action type.

the map-reduce action contains the following elements, you need to define them in the same order as show here, you can skip the element if not required, but the order should be maintained. The job configuration will be loaded in the same order as the elements appear in workflow. for example you specify a mapper in the streaming element. if you specify the mapred.mapper.class in configuration element again, this will take effect and streaming mapper will be ignored. The configuration properties are loaded in the following order, streaming , job-xml and configuration , and later values override earlier values. The inline elements can be parameterized using EL expressions.

<job-tracker> required element

you can specify the job tracker RPC address or resource manger address in this element

<job-tracker>localhost:8032</job-tracker>

<name-node> required element

for specifying name node address

<name-node>hdfs://localhost:8020</name-node>

<prepare>

<streaming> or <pipes>

<job-xml>

<configuration>

<file>

<archive>

lets see what each element means

<prepare>

prepare is kind of pre-processor for your jobs, it can delete or create directories that job will depend on,for example map-reduce job will fail if the output directory is present, you can use the prepare to delete the output directory , the operation is performed on fs.default.name

<streaming> or <pipes>

You can write map reduce program in other language, hadoop supports this through streaming and pipes. You can’t use both of them in single mapreduce action.

<streaming>
 <mapper>[MAPPER-PROCESS]</mapper>
 <reducer>[REDUCER-PROCESS]</reducer>
 <record-reader>[RECORD-READER-CLASS]</record-reader>
 <record-reader-mapping>[NAME=VALUE]</record-reader-mapping>
 ...
 <env>[NAME=VALUE]</env>
 ...
 </streaming>
 <!-- Either streaming or pipes can be specified for an action, not both -->
 <pipes>
 <map>[MAPPER]</map>
 <reduce>[REDUCER]</reducer>
 <inputformat>[INPUTFORMAT]</inputformat>
 <partitioner>[PARTITIONER]</partitioner>
 <writer>[OUTPUTFORMAT]</writer>
 <program>[EXECUTABLE]</program>
 </pipes>

<job-xml>

if your hadoop job has already defined JobConf.xml , you use this element to specify that file path (the file should be bundled in the workflow application) . The Hadoop mapred.job.tracker and fs.default.name properties must not be present in the job-xml and inline configuration.

<job-xml>/mrjob.xml</job-xml>

<configuration>

you can specify the your job configuration here as name, value pairs . You don’t need to write any driver code to start the map reduce job, oozie will take care of that , you can specify all your settings in these elements. if you have driver class already you can use the <config-class>

<configuration>
 <property>
 <name>[PROPERTY-NAME]</name>
 <value>[PROPERTY-VALUE]</value>
 </property>
 ...
 </configuration>
 <config-class>com.example.MyConfigClass</config-class>

from schema 0.5 version you can also use a new element to specify the a driver class (the driver class should implement OozieActionConfigurator).

<file> and <archive>

these elements provides a way to distribute the data/files required by the job around the cluster.these work the same was as the hadoop -files,-archives options works.

<file>hdfs://localhost:8020/user/myUser/wf/myDir1/myFile.txt#myfile</file>
<archive>hdfs://localhost:8020/user/myUser/wf/mytar.tgz#mygzdir</archive>

Example Workflow with a map-reduce action

we will see how to run a mapreduce job ( written in new mapreduce API) using a oozie work flow. We will be using the wordcount example.


<workflow-app name="Mapreduce_Job" xmlns="uri:oozie:workflow:0.5">
<start to="wordcount"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="wordcount">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${output}"/>
</prepare>
<configuration>
<property>
<name>mapreduce.input.fileinputformat.inputdir</name>
<value>${input}</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.outputdir</name>
<value>${output}</value>
</property>
<property>
<name>mapreduce.job.map.class</name>
<value>net.icircuit.hadoop.wordcount.WcMapper</value>
</property>
<property>
<name>mapreduce.job.reduce.class</name>
<value>net.icircuit.hadoop.wordcount.WcReducer</value>
</property>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.map.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>

<property>
<name>mapreduce.job.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.job.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>

</configuration>
</map-reduce>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

you need provide a properties file with the following parameters

nameNode=hdfs://localhost:8020
jobTracker=localhost:8050
oozie.use.system.libpath=True
userhome=${nameNode}/user/${user.name}
oozie.wf.application.path=${userhome}/WordCountOozieWorkflow/
oozie.libpath=hdfs://localhost:8020/user/oozie/share/lib
input=/input/path
output=/output/path

you need to pack the jar containing the mapper and reducer place it in /lib folder workflow root folder. The workflow root folder should contain workflow.xml. place the folder on HDFS and mention the root folder path in the job properties file oozie.wf.application.path

to run the workflow

oozie job -run -config Job.properties