Hadoop – Text file to Sequence File
|Hadoop Supports many input/output file formats. In this blog we will see how to read a text file and we will save the results in Sequence file format.
The file is CSV with the following fields
user_id,song_id,listen_count,title,artist,song
Our requirement is to group the songs by an artist. So the resulted file will have a single record for each artist, with the all the songs sang by him.
Map :
First we need to ignore the record with the offset 0 (key to the map) as that will contain the name of the field.
if(key.get()==0){ return; }
then split the record by , and write field 4 as key (artist) , field 5 as value (song)
String[] tokens=value.toString().split(","); context.write(new Text(tokens[4]), new Text(tokens[5]));
Reducer :
On reducer side we just need we just need convert the all songs into a writable
ArrayList<Text> vs=new ArrayList<Text>(); for(Text t:values){ vs.add(t); } context.write(key, new Text(vs.toString()));
Now the reducer output will be saved in Sequence file format. In the reducer you can use better approach to save the songs collection. Instead of adding it to a ArrayList and again converting it to a Text we can use ArrayWritable
Full project is available here