Tutorial 2 : Hadoop Map Reduce Global variable

Tutorial 2 : Hadoop Map Reduce Global variable

Objective :

Write a MapReduce program that searches for occurrences of a given string in a large file. Probably you may think about using a grep command line and that’s it. But what if the size of the file was too large it will take too much time.

 

  • First of all you need to insure that you successfully installed hadoop on your machine . Check this link if you need to know how to install it.

 

  • Then you should start hadoop daemon by invoking this scripts:

            start-dfs.sh

            start-yarn.sh

  • Ok, one last step before starting, you need to copy the input files into your locale hadoop file system, and create some directories in hdfs before copying. Download the two input files (they are small files just for testing) : download link
  • After that create paths in hdfs by invoking :

            hdfs dfs -mkdir -P /training/lab2/inputs/

 

  • After that copy files to hdfs by invoking this command like : hdfs dfs -copyFromLocal  <localPathOfFiles> /training/lab2/inputs/

 

  • For example if you downloaded the files into Downloads/lab2/inputs, than the command line should be: hdfs dfs -copyFromLocal ~/Downloads/lab2/inputs/* /training/lab2/inputs/

 

  • Now that everything is already setup, let’s start coding, First you should create a Job class that extends Configured (so you get the configuration from the installation files “core-site.xml etc ….”) and implements Tool (By doing this you can invoke your job from command line via hadoop jar command). By writing this class you will give the job information about the input format, output format, the mapper, the reducer, the key and value output format of mapper and reducer etc …

 

  • Now let’s have a look on the mapper,  the role of the mapper is to search for the word.

 

 

  • If the file was too large, it will be split on multiple block, and the number of mappers will be equal to the number of blocks. So to minimize the transfer of data we could write a combiner

  • Now let’s have a look at the reducer, the role of the reducer will be to sum the values, that’s it :

 

 

  • Note: we could instead of writing our own reducer we could use IntSumReducer<Key>

and use Text as a key

 

  • Note: also to minimise the transfer of data, in the case of a line where there is no appearance of the searching word, our mapper will write (“searchingWord”,0) so it would be better to write nothing in this case, so we should add

  • Now after coding, export the jar as a runnable jar and specify MinMaxJob as a main class, then open terminal and run the job by invoking : hadoop jar nameOfTheJar.jar

for example if you give the jar the name lab1.jar than the command line will be : hadoop jar lab2.jar and have a look on the result by invoking : hdfs dfs -cat /training/lab2/output/part-r-00000

Author: Ayman Ben Amor

2 Comments
  • Posted at 8:15 am, June 30, 2017

    Great and interesting article to read.. i Gathered more useful and new information from this article.thanks a lot for sharing this article to us..

    big data training in chennai

  • Posted at 5:02 am, July 1, 2017

    Interesting article to read.. I got more useful information about mapreduce from this article.. thanks a lot for sharing

    hadoop training

Post a Comment

Comment
Name
Email
Website