Tutorial 1: Using Combiner

Tutorial 1: Using Combiner

  • At the beginning we will start with a simple hadoop job. Suppose that we have some big files where each line contain temperature degree, and we want to get the maximum and minimum.


  • I hear you saying why map reduce I can do it in a sequence java program, ok then how much time does it take to get the result from a file higher then 4GB for example …


  • Let’s start with hadoop installation by folling this link .


  • Then you should start hadoop daemon by invoking this scripts:


  • Ok, one last step before starting, you need to copy the input files into your locale hadoop file system and create some directories in hdfs before copying.


  • So download the two input files (they are small files just for testing) : download link


  • After that, create paths in hdfs by invoking : hdfs dfs -mkdir -p /training/lab1/inputs/


  • Then, copy them to hdfs by invoking a command like this: hdfs dfs -copyFromLocal /training/lab1/inputs/

for example if you downloaded the files into Downloads/lab1/inputs/, then the command line should be: hdfs dfs -copyFromLocal ~/Downloads/lab1/inputs/* /training/lab1/inputs/


  • Now that everything is already setup, let’s start coding, First you should create a Job class that extends Configured class and implements Tool interface. By writing this class you will give the job all the information about the input format, output format, the mapper, the reducer, the key and value output format of mapper and reducer etc …


  • Now let’s have a look at the mapper, well before digging into codes a small explanation will be better.


  • In our case the role of the mapper is to filter the data and prepare data for reducers.



  • Now let’s have a look at the reducer, the KeyInputFormat and ValueInputFormat of the reucer should be equals to the KeyOutputFormat and ValueOutputFormat of the mapper.


  • In our case the role of the reducer is to count the minimum and maximum.


  • Export the jar as a runnable jar and specify MinMaxJob as a main classn then open terminal and run the job by invoking : hadoop jar nameOfTheJar.jar


  • for example if you give the jar the name lab1.jar then the command line will be : hadoop jar lab1.jar


  • You can have a look on the result by invoking : hdfs dfs -cat /training/lab1/output/part-r-00000


  • Now after coding our first lab, let’s have a look and try to optimize it. In the previous job, we have two files and each file will be treated by a mapper (if the size of the file is greater then 128 MB, then the file will be split into parts and each part will be processed by a mapper).


  • Each mapper write a pairs of (key, value) that will be transferred via network to reducers. What about limiting the size of data to transfer ? can we for example count the min and max value in each part ?


    • Yes, we can do it. But not inside the mapper, because the map function will be called on each line and we don’t know if the value is min or max …
      That’s how Combiner come into action, the role of combiner is to perform some action to limit the size of the data to transfer (in our case counting min and max).


    • And of corse we need to tell the job about the combiner:


Author: Ayman Ben Amor

No Comments

Post a Comment