Tutorial 3 : Hadoop Map Reduce Multiple Output

Tutorial 3 : Hadoop Map Reduce Multiple Output

  • The inverted index problem is one of the earliest and most common uses of MapReduce. The IndexInverterJob takes a set of <key,value> pairs and inverts the index, so that each value becomes a key.

 

  • So first of all you need to insure that you successfully installed hadoop on your machine. Check this link if you need to know how to install it. 

 

  • Then you should start hadoop daemon by invoking this scripts:

             start-dfs.sh

             start-yarn.sh

 

  • Ok, one last step before starting, you need to copy the input files into your locale hadoop file system, and create some directories in hdfs before copying.

So download the two input files (they are small files just for testing) : download link

 

  • After that create paths in hdfs by invoking : hdfs dfs -mkdir -P /training/lab3/inputs/

 

  • After that copy them to hdfs by invoking a command like this : hdfs dfs -copyFromLocal <localPathOfFiles> /training/lab3/inputs/

for example if you downloaded the files into Downloads/lab3/inputs/, than the command line should be: hdfs dfs -copyFromLocal ~/Downloads/lab3/inputs/* /training/lab3/inputs/

 

  • Now that everything is already setup, let’s start coding, First you should create a Job class that extends Configured (so you get the configuration from the installation files “core-site.xml etc ….”) and implements Tool (By doing this you can invoke your job from command line via hadoop jar command). By writing this class you will give the job information about the input format, output format, the mapper, the reducer, the key and value output format of mapper and reducer etc …

 

  • Now let’s have a look on the mapper,  the role of the mapper is to invert the key and the value. So the mapper get an url as key and list of keywords separated by a comma as a value and on each keyword the mapper write to the output the keyword as  key and the url as  value.

 

 

  • After that, hadoop will perform the shuffling and regroup each (key, value) pairs that have the same key in (key, value 1, value 2 … value n) and pass them to the reducer.

 

  • After that each reducer (we may have multiple reducers) will get the key concat the values separated by a comma and write them to hdfs.

 

 

  • Now after coding, export the jar as a runnable jar and specify MinMaxJob as a main class, then open terminal and run the job by invoking :

hadoop jar <nameOfTheJar.jar>, for example if you give the jar the name lab1.jar than the command line will be : hadoop jar lab3.jar

 

  • Have a look on the result by invoking :

hdfs dfs -cat /training/lab3/output/part-r-00000

  • Now let’s optimize a little bit our code, let’s change the name of the  output file, add the line below to the job class:

 

  • And in the reducer, instead of using context.write(key, new Text(sb.toString()));, use :

 

  • If you run this job you will get an empty file named part-r-00000, can i remove it ? yes use :

 

  • When you use LazyOutputFormat as an output format, hadoop will not create the file unless you write something in it.

 

  • Ok now suppose that I would like to output two different formats, one will be used by an another map reducer as an input (SequenceFileOutputFormat) and the other one will be TextOutputFormat. how to do it ?

 

    • In the job :

 

    • and in the reducer:

Author: Nizar Ellouze

No Comments

Post a Comment

Comment
Name
Email
Website