Hadoop – Text file to Sequence File

Hadoop Supports many input/output file formats. In this blog we will see how to read a text file and we will save the results in Sequence file format.

The file is CSV with the following fields

user_id,song_id,listen_count,title,artist,song

Our requirement is to group the songs by an artist. So the resulted file will have a single record for each artist, with the all the songs sang by him.

Map :

First we need to ignore the record with the offset 0 (key to the map) as that will contain the name of the field.


if(key.get()==0){
return;
}

then split the record by , and write field 4 as key (artist) , field 5 as value (song)


String[] tokens=value.toString().split(",");
context.write(new Text(tokens[4]), new Text(tokens[5]));

Reducer :

On reducer side we just need we just need convert the all songs into a writable


ArrayList<Text> vs=new ArrayList<Text>();
for(Text t:values){
vs.add(t);
}
context.write(key, new Text(vs.toString()));

Now the reducer output will be saved in Sequence file format. In the reducer you can use better approach to save the songs collection. Instead of adding it to a ArrayList and again converting it to a Text we can use ArrayWritable

Full project is available here

Add a Comment

Your email address will not be published. Required fields are marked *