Hadoop Supports many input/output file formats. In this blog we will see how to read a text file and we will save the results in Sequence file format.
The file is CSV with the following fields
user_id,song_id,listen_count,title,artist,song
Our requirement is to group the songs by an artist. So the resulted file will have a single record for each artist, with the all the songs sang by him.
Map :
First we need to ignore the record with the offset 0 (key to the map) as that will contain the name of the field.
then split the record by , and write field 4 as key (artist) , field 5 as value (song)
1 | String[] tokens=value.toString().split(","); |
2 | context.write( new Text(tokens[ 4 ]), new Text(tokens[ 5 ])); |
Reducer :
On reducer side we just need we just need convert the all songs into a writable
1 | ArrayList<Text> vs= new ArrayList<Text>(); |
5 | context.write(key, new Text(vs.toString())); |
Now the reducer output will be saved in Sequence file format. In the reducer you can use better approach to save the songs collection. Instead of adding it to a ArrayList and again converting it to a Text we can use ArrayWritable
Full project is available here