Tutorial 0 : Hadoop Map Reduce Partitionner

15 Dec

Tutorial 0 : Hadoop Map Reduce Partitionner

by Ayman Ben Amor

in Hadoop labs

Comments

At the beginning we will start with a simple hadoop job. Suppose that we have a big file that contains many words sperated by a white space, and we want to get the number of appearance of each word. Also we need that the words from [A-L] will be in the first part and the others in the second part.

Let’s start with hadoop installation by folling this link .

- Then you should start hadoop daemon by invoking this scripts:

    start-dfs.sh
    start-yarn.sh

1 2	start-dfs.sh start-yarn.sh

Ok, one last step before starting, you need to copy the input files into your locale hadoop file system and create some directories in hdfs before copying.

So download the two input file (a small file just for testing) : download link

- After that, create paths in hdfs by invoking :

    hdfs dfs -mkdir -p /training/lab0/inputs/

1	hdfs dfs -mkdir -p /training/lab0/inputs/

- Then, copy them to hdfs by invoking a command like this:

    hdfs dfs -copyFromLocal ... /training/lab0/inputs/

1	hdfs dfs -copyFromLocal ... /training/lab0/inputs/

for example if you downloaded the files into Downloads/lab0/inputs/, then the command line should be:

    hdfs dfs -copyFromLocal ~/Downloads/lab0/inputs/* /training/lab0/inputs/

1	hdfs dfs -copyFromLocal ~/Downloads/lab0/inputs/* /training/lab0/inputs/

First you should create a Job class that extends Configured class and implements Tool interface. By writing this class you will give the job all the information about the input format, output format, the mapper, the reducer, the key and value output format of mapper and reducer etc …

public class WordCountJob extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {	
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "WordCountJob");
        
        // Give the job the name of the main class
        job.setJarByClass(WordCountJob.class);

        // Specify the input format, which will have impact on the key and the value
        // type of the mapper inputs.
        job.setInputFormatClass(TextInputFormat.class);

        // By specifying TextOutputFormat as an output the format of your file will be
        // Key.toString()than 4 spaces (\t) than value.toString(), for example 12 14
        job.setOutputFormatClass(TextOutputFormat.class);

        // because if we run a job and give it an output that is already exist, the job will fail
        TextOutputFormat.setOutputPath(job, new Path(args[1]));

        // specify the input paths in the hdfs
        TextInputFormat.setInputPaths(job, new Path(args[0]));

        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path(args[1]));

        // Give the job the name of the mapper class
        job.setMapperClass(WordCountMapper.class);

        // Give the job the name of the reducer class
        job.setReducerClass(WordCountReducer.class);
        
        // Give the job the name of the partitioner class
        job.setPartitionerClass(WordCountPartitioner.class);

        // Give the job the number of reducers
        // The first one will treat the words in [A,L]
        // The second one will treat others
        job.setNumReduceTasks(2);
	
        // set the key output type	
        job.setOutputKeyClass(Text.class);

        // set the value output type
        job.setOutputValueClass(IntWritable.class); 
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new WordCountJob(), new String [] {
	    "hdfs://localhost:9000/training/lab0/inputs*",
	    "hdfs://localhost:9000/training/lab0/output/"
	});
        System.exit(exitCode);
    }
}

public class WordCountJob extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "WordCountJob");

// Give the job the name of the main class

job.setJarByClass(WordCountJob.class);

// Specify the input format, which will have impact on the key and the value

// type of the mapper inputs.

job.setInputFormatClass(TextInputFormat.class);

// By specifying TextOutputFormat as an output the format of your file will be

// Key.toString()than 4 spaces (\t) than value.toString(), for example 12 14

job.setOutputFormatClass(TextOutputFormat.class);

// because if we run a job and give it an output that is already exist, the job will fail

TextOutputFormat.setOutputPath(job, new Path(args[1]));

// specify the input paths in the hdfs

TextInputFormat.setInputPaths(job, new Path(args[0]));

job.setOutputFormatClass(TextOutputFormat.class);

TextOutputFormat.setOutputPath(job, new Path(args[1]));

// Give the job the name of the mapper class

job.setMapperClass(WordCountMapper.class);

// Give the job the name of the reducer class

job.setReducerClass(WordCountReducer.class);

// Give the job the name of the partitioner class

job.setPartitionerClass(WordCountPartitioner.class);

// Give the job the number of reducers

// The first one will treat the words in [A,L]

// The second one will treat others

job.setNumReduceTasks(2);

// set the key output type

job.setOutputKeyClass(Text.class);

// set the value output type

job.setOutputValueClass(IntWritable.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new WordCountJob(), new String [] {

"hdfs://localhost:9000/training/lab0/inputs*",

"hdfs://localhost:9000/training/lab0/output/"

});

System.exit(exitCode);

}

Let’s understand how mapper works.

In our case the role of the mapper is to write 1 as a value for each word (as a key ).

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

	private final static IntWritable ONE = new IntWritable(1);

	@Override
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
			throws IOException, InterruptedException {

			// We split by white space each line we read on the value
			String[] words = value.toString().split(" ");

			for (int i = 0; i < words.length; i++) {
				// write each word on the key with a 1 on the value
				context.write(new Text(words[i].toLowerCase()), ONE);
			}
	}
}

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable ONE = new IntWritable(1);

@Override

protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)

throws IOException, InterruptedException {

// We split by white space each line we read on the value

String[] words = value.toString().split(" ");

for (int i = 0; i < words.length; i++) {

// write each word on the key with a 1 on the value

context.write(new Text(words[i].toLowerCase()), ONE);

}

Now let’s have a look at the partitioner, it should extends Partitioner<MapperOutPutKeyType, MapperOutPutValueType<.

In our case is to forward each key from the mapper to a specific reducer.

public class WordCountPartitioner extends Partitioner<Text, IntWritable<{

	@Override
	public int getPartition(Text key, IntWritable value, int numPartitions) {
		
		// Get the first char of the word		
		char decider = key.toString().toUpperCase().charAt(0);
		char A = 'A';
		char L = 'L';

		// if first char in [A,L] then go to the first reducer
		if((A<=decider) && (decider<=L)) {
			return 0;
		// else go to the second reducer
		} else {
			return 1;
		}
	}
}

public class WordCountPartitioner extends Partitioner<Text, IntWritable<{

@Override

public int getPartition(Text key, IntWritable value, int numPartitions) {

// Get the first char of the word

char decider = key.toString().toUpperCase().charAt(0);

char A = 'A';

char L = 'L';

// if first char in [A,L] then go to the first reducer

if((A<=decider) && (decider<=L)) {

return 0;

// else go to the second reducer

} else {

return 1;

}

- - Now let’s have a look at the reducer, the KeyInputFormat and ValueInputFormat of the reucer should be equals to the KeyOutputFormat and ValueOutputFormat of the mapper.

- - In our case the role of the reducer is to sum the value for each word (key).

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,
			Reducer<Text, IntWritable, Text, IntWritable>.Context context)
					throws IOException, InterruptedException {
		int sum = 0;
		// We sum the values
		for (IntWritable value : values) {
			sum = sum + value.get();
		}
		//We write the word followed by the sum
		context.write(key, new IntWritable(sum));
	}
}

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override

protected void reduce(Text key, Iterable<IntWritable> values,

Reducer<Text, IntWritable, Text, IntWritable>.Context context)

throws IOException, InterruptedException {

int sum = 0;

// We sum the values

for (IntWritable value : values) {

sum = sum + value.get();

}

//We write the word followed by the sum

context.write(key, new IntWritable(sum));

}

- - Export the jar as a runnable jar and specify WordCountJob as a main class, then open terminal and run the job by invoking :

    hadoop jar nameOfTheJar.jar

1	hadoop jar nameOfTheJar.jar

- - for example if you give the jar the name lab0.jar then the command line will be :

    hadoop jar lab0.jar

1	hadoop jar lab0.jar

- - You can have a look on the result by invoking :

    hdfs dfs -ls /training/lab0/output

1	hdfs dfs -ls /training/lab0/output

Author: Ayman Ben Amor

0 Shares

Tags:

hadoop,map reduce,partitionner

Tutorial 0 : Hadoop Map Reduce Partitionner