Text To Avro Data File using Mapreduce

Sankar Cheppali | August 21, 2016 | Hadoop, Java | 3 Comments

We have already seen how to convert a text file to Avro data file using a simple java program. In this blog we will see how to process a text file and store the result in avro data file. To run map reduce jobs on Avro data files see this blog.

Input File Format :

Our input file is a movie database, contains the following information

serial number :: movie name (year)::tag1|tag2

example records

1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's

Avro Schema for output file :

we can ignore the serial number,we can store the tags in array.Check this blog to know more about the supported types. The resulting schema is

{
&quot;name&quot;:&quot;movies&quot;,
&quot;type&quot;:&quot;record&quot;,
&quot;fields&quot;:[
 {&quot;name&quot;:&quot;movieName&quot;,
 &quot;type&quot;:&quot;string&quot;
 },

{&quot;name&quot;:&quot;year&quot;,
 &quot;type&quot;:&quot;string&quot;
 },

{&quot;name&quot;:&quot;tags&quot;,
 &quot;type&quot;:{&quot;type&quot;:&quot;array&quot;,&quot;items&quot;:&quot;string&quot;}
 }
 ]

}

we could have stored the year in the integer format as well.

The map function will extract the different fields from the input record and constructs a generic record. The reduce function will simply write it’s key to the output, no processing is done in reducer.

Program Code


public class MRTextToAvro extends Configured implements Tool{
 public static void main(String[] args) throws Exception {
 int exitCode=ToolRunner.run(new MRTextToAvro(),args );
 System.out.println(&quot;Exit code &quot;+exitCode);
 }

 public int run(String[] arg0) throws Exception {
 Job job= Job.getInstance(getConf(),&quot;Text To Avro&quot;);

 job.setJarByClass(getClass());

 FileInputFormat.setInputPaths(job, new Path(arg0[0]));
 FileOutputFormat.setOutputPath(job, new Path(arg0[1]));
 Schema.Parser parser = new Schema.Parser();
 Schema schema=parser.parse(getClass().getResourceAsStream(&quot;movies.avsc&quot;));
 job.getConfiguration().setBoolean(
 Job.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true);
 AvroJob.setMapOutputKeySchema(job,schema);
 job.setMapOutputValueClass( NullWritable.class);
 AvroJob.setOutputKeySchema(job, schema);

 job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(AvroKeyOutputFormat.class);

 job.setMapperClass(TextToAvroMapper.class);
 job.setReducerClass(TextToAvroReduce.class);

 return job.waitForCompletion(true)?0:1;
 }

 public static class TextToAvroMapper extends Mapper&amp;amp;amp;lt;LongWritable ,Text,AvroKey&amp;amp;amp;lt;GenericRecord&amp;amp;amp;gt;,NullWritable&amp;amp;amp;gt;{
 Schema schema;

 protected void setup(Context context) throws IOException, InterruptedException {
 super.setup(context);
 Schema.Parser parser = new Schema.Parser();
 schema=parser.parse(getClass().getResourceAsStream(&quot;movies.avsc&quot;));
 }

 public void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException{
 GenericRecord record=new GenericData.Record(schema);
 String inputRecord=value.toString();

 record.put(&quot;movieName&quot;, getMovieName(inputRecord));
 record.put(&quot;year&quot;, getMovieRelaseYear(inputRecord));
 record.put(&quot;tags&quot;, getMovieTags(inputRecord));
 context.write(new AvroKey(record), NullWritable.get());
 }
 public String getMovieName(String record){
 String movieName=record.split(&quot;::&quot;)[1];
 return movieName.substring(0, movieName.lastIndexOf('(' )).trim();
 }
 public String getMovieRelaseYear(String record){
 String movieName=record.split(&quot;::&quot;)[1];
 return movieName.substring( movieName.lastIndexOf( '(' )+1,movieName.lastIndexOf( ')' )).trim();
 }
 public String[] getMovieTags(String record){
 return (record.split(&quot;::&quot;)[2]).split(&quot;\\|&quot;);
 }
 }

 public static class TextToAvroReduce extends Reducer&amp;amp;amp;lt;AvroKey&amp;amp;amp;lt;GenericRecord&amp;amp;amp;gt;,NullWritable,AvroKey&amp;amp;amp;lt;GenericRecord&amp;amp;amp;gt;,NullWritable&amp;amp;amp;gt;{
 @Override
 public void reduce(AvroKey&amp;amp;amp;lt;GenericRecord&amp;amp;amp;gt; key,Iterable&amp;amp;amp;lt;NullWritable&amp;amp;amp;gt; value,Context context) throws IOException, InterruptedException{
 context.write(key, NullWritable.get());
 }
 }
}

To run the job we need to put the avro related lib on class path


export HADOOP_CLASSPATH=/path/to/targets/avro-mapred-1.7.4-hadoop2.jar

yarn jar HadoopAvro-1.0-SNAPSHOT.jar MRTextToAvro -libjars avro-mapred-1.7.4-hadoop2.jar /input/path output/path

The resulting Avro file

Full project is available on github

Tags:Avro, hadoop, Java, mapreduce

About The Author

Sankar Cheppali

3 Comments

bunny November 28, 2017 Reply

What jar file should be added ? Because is showing error for me in (import org.apache.avro.*;) in this part of code

Sankar Cheppali November 29, 2017 Reply

Hi,
all the dependencies are present in pom.xml. eclipse should pull them automatically

bunny December 1, 2017 Reply

export HADOOP_CLASSPATH=/path/to/targets/avro-mapred-1.7.4-hadoop2.jar
yarn jar HadoopAvro-1.0-SNAPSHOT.jar MRTextToAvro -libjars avro-mapred-1.7.4- hadoop2.jar /input/path output/path

In this hadoop2.jar is exported jarfile from eclipse? and export hadoop_classpath is set in bashfile? can u explain the hadoop comment for executing this(like input file is record? where you are placing movies.avsc in hdfs) in detail. Im newbie to hadoop i cant understand the flow

iCircuit

Text To Avro Data File using Mapreduce

Input File Format :

Avro Schema for output file :

Program Code

About The Author

Sankar Cheppali

Add a Comment

Input File Format :

Avro Schema for output file :

Program Code

Related Posts

About The Author

Sankar Cheppali

Add a Comment