Tutorial 4 : Hadoop Custom Input Format

Tutorial 4 : Hadoop Custom Input Format

In each city in Tunisia, we have a temperature sensor that send data to server. In the server each record it’s stored in a file. Unfortunately the structure of the data is not the same in all cities. For example in Sfax each record is stored as follow (year month day    Sfax  temperature) for example (1950 4 30 sfax 30) and in Sousse each record is stored as follow (sousse temperature      1950 4 30).

Our objective is to calculate the average of both cities for each day.

Challenge:

How can I deal with data that’s not structured in the same way ?

  • So first of all you need to insure that you successfully installed hadoop on your machine. Check this link if you need to know how to install it. 

 

  • Then you should start hadoop daemon by invoking this scripts:

             start-dfs.sh

             start-yarn.sh

 

  • Ok, one last step before starting, you need to copy the input files into your locale hadoop file system and create some directories in hdfs before copying.

So download the two input files (they are small files just for testing) : download link

 

  • After that create paths in hdfs by invoking : hdfs dfs -mkdir -P /training/lab4/inputs/

 

  • After that copy them to hdfs by invoking a command like this :  hdfs dfs -copyFromLocal <localPathOfFiles> /training/lab4/inputs/

 

  • For example if you downloaded the files into Downloads/lab4/inputs/, than the command line should be: hdfs dfs -copyFromLocal ~/Downloads/lab3/inputs/* /training/lab3/inputs/

 

  • Now that everything is already set up, let’s start coding, First you should create a Job class that extends Configured (so you get the configuration from the installation files “core-site.xml etc ….”) and implements Tool (By doing this you can invoke your job from command line via hadoop jar command). By writing this class you will give the job information about the input format, output format, the mapper, the reducer, the key and value output format of mapper and reducer etc …

 

  • Now let’s have a look at how to add a custom key and value. To create a custom key you should implements WritableComparable and to create a custom value you should implements Writable, let’s start with the value:

 

Than the key should implements WritableComparable because hadoop will use the compareTo method to sort keys in the shuffling step

Now let’s have a look at how to write a custom input format, you need to extends FileInputFormat<KeyType,ValueType> (the KeyType should implements WritableComparable and the ValueType should implements Writable) and override the createRecordReader method:

The CustomInputFromat will use a CustomRecordReader that will transfer the file in hdfs to a (key, value) records. The CustomRecordReader intialize method is the first one that will be called afet that a loop

while (getProgress !=1) {

nextKeyValue will be called

than getCurrentKey(), getCurrentValue()

}

finally the close() method is called

In the same way you can check the source code of SousseInputFormat

Now the TemperatureMapper will not split the data, it will just get the key and value without parsing data:

And in the reducer, we will juste calculate the mean of the two temperature :

 

  • Now after coding, export the jar as a runnable jar and specify MinMaxJob as a main class, then open terminal and run the job by invoking : hadoop jar <nameOfTheJar.jar>. For example if you give the jar the name lab1.jar than the command line will be : hadoop jar lab4.jar

 

  • Have a look on the result by invoking : hdfs dfs -cat /training/lab4/output/part-r-00000

 

Author: Nizar Ellouze

No Comments

Post a Comment

Comment
Name
Email
Website