Running Map Reduce Job on Avro Data Files

Sankar Cheppali | August 30, 2016 | Hadoop, Java | No Comments

In the previous blog we have seen how to convert a text file to Avro data file.In this blog we will see how to run a map reduce job on Avro data file. We will use the output generated by the TextToAvro job.

The job will count number of films released in each year. The map function will extract year of the film for each record and send it to reducer as a AvroKey .

Input file schema :

Though it has all the fields present in the input data, we will be using only the year

1{
2"name":"movies",
3"type":"record",
4"fields":[
5 {"name":"movieName",
6 "type":"string"
7 },
8 
9 {"name":"year",
10 "type":"string"
11 },
12 
13 {"name":"tags",
14 "type":{"type":"array","items":"string"}
15 }
16 ]
17 
18}

We will be using TextOutputFormat to write the output.

Program

1public class CountMoviesByYear extends Configured implements Tool{
2 
3 public static void main(String[] args) throws Exception {
4  
5 int exitCode=ToolRunner.run(new CountMoviesByYear(), args);
6 System.out.println("Exit code "+exitCode); 
7 
8 }
9 
10 public int run(String[] args) throws Exception {
11 Job job=Job.getInstance(getConf(),"CountMoviesByYear");
12 job.setJarByClass(getClass());
13 job.setMapperClass(MovieCounterMapper.class);
14 job.setReducerClass(MovieCounterReducer.class);
15  
16 Schema.Parser parser = new Schema.Parser();
17 Schema schema=parser.parse(getClass().getResourceAsStream("movies.avsc"));
18 
19 AvroJob.setInputKeySchema(job, schema);
20  
21 AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.STRING));
22 AvroJob.setMapOutputValueSchema(job, Schema.create(Schema.Type.INT));
23  
24 job.setOutputKeyClass(Text.class);
25 job.setOutputValueClass(IntWritable.class);
26  
27 job.setInputFormatClass(AvroKeyInputFormat.class);
28 job.setOutputFormatClass(TextOutputFormat.class);
29  
30 FileInputFormat.setInputPaths(job, new Path(args[0]));
31 FileOutputFormat.setOutputPath(job, new Path(args[1]));
32 return job.waitForCompletion(true) ? 0:1;
33 }
34 
35 public static class MovieCounterMapper extends Mapper<AvroKey<GenericRecord>,NullWritable,AvroKey<String> ,AvroValue<Integer> >{
36  
37 public void map(AvroKey<GenericRecord> key,NullWritable value,Context context) throws IOException,InterruptedException{
38 context.write(new AvroKey(key.datum().get("year")), new AvroValue(new Integer(1))); 
39 }
40 }
41  
42 public static class MovieCounterReducer extends Reducer<AvroKey<String> ,AvroValue<Integer>,Text,IntWritable>{
43 @Override
44 public void reduce (AvroKey<String> key,Iterable<AvroValue<Integer> > values,Context context) throws IOException,InterruptedException{
45 int count=0;
46 for(AvroValue<Integer> value:values){
47 count+=value.datum().intValue();
48 }
49 context.write(new Text(key.datum().toString()), new IntWritable(count));
50 }
51 }
52}

Running the Job

`1`	`export HADOOP_CLASSPATH=/path/to/avro-mapred-1.7.4-hadoop2.jar`

`2`	`yarn jar HadoopAvro-1.0-SNAPSHOT.jar CountMoviesByYear -libjars avro-mapred-1.7.4-hadoop2.jar movies/part-r-00000.avro moviecount1`

Output of the job

Tags:Avro, hadoop, mapreduce

iCircuit

Running Map Reduce Job on Avro Data Files

Input file schema :

Program

Running the Job

Output of the job

About The Author

Sankar Cheppali

Add a Comment

Input file schema :

Program

Running the Job

Output of the job

Related Posts

About The Author

Sankar Cheppali

Add a Comment