Tutorial 3 : Hadoop Map Reduce Multiple Output

05 Dec

Tutorial 3 : Hadoop Map Reduce Multiple Output

by Nizar Ellouze

in Hadoop labs

Comments

The inverted index problem is one of the earliest and most common uses of MapReduce.The IndexInverterJob takes a set of <key,value> pairs and inverts the index, so that each value becomes a key.

So first of all you need to insure that you successfully installed hadoop on your machine. Check this link if you need to know how to install it.

Then you should start hadoop daemon by invoking this scripts:

start-dfs.sh

start-yarn.sh

Ok, one last step before starting, you need to copy the input files into your locale hadoop file system, and create some directories in hdfs before copying.

So download the two input files (they are small files just for testing) : download link

After that create paths in hdfs by invoking : hdfs dfs -mkdir -P /training/lab3/inputs/

Copy the files to hdfs by executing : hdfs dfs -copyFromLocal <localPathOfFiles> /training/lab3/inputs/

for example if you downloaded the files into Downloads/lab3/inputs/, than the command line should be: hdfs dfs -copyFromLocal ~/Downloads/lab3/inputs/* /training/lab3/inputs/

Let’s start, first you should create a Job class that extends Configured (so you get the configuration from the installation files “core-site.xml etc ….”) and implements Tool (By doing this you can invoke your job from command line via hadoop jar command). By writing this class you will give the job information about the input format, output format, the mapper, the reducer, the key and value output format of mapper and reducer etc …

public class KeywordsJob extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("error");
            return -1;
        }

        // getConf method it's inherited from Configured class, that's why we should extends Configured
        Configuration conf = getConf();

        // because our data is represented in a way that each like contain key, value sperated by a tab
        // we need to tell hadoop that the separator is a tab
        // by the way if you don't specify it \t it's the default separator used by hadoop
        conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "\t");

        Job job = Job.getInstance(conf, "TP3");
        // Give the job the name of the main class
        job.setJarByClass(KeywordsJob.class);

        // KeyValueTextFormat it's suitable in our case, where the website url will be the key
        // and the list of words that appear in the website separated by a comma are the value
        // by the way when using  KeyValueTextInputFormat, MapperInputKey and MapperInputValue will be Text
        job.setInputFormatClass(KeyValueTextInputFormat.class);

        // TextOutputFormat it's suitable in our case
        job.setOutputFormatClass(TextOutputFormat.class);

        // specify the input paths in the hdfs
        TextInputFormat.setInputPaths(job, new Path(args[0]));
        // specify the output paths in the hdfs
        // We need to ensure that the output file doesn't already exist
        // because if we run a job and give it an output that is already exist, the job will fail.
        TextOutputFormat.setOutputPath(job, new Path(args[1]));

        // Give the job the name of the mapper class
        job.setMapperClass(KeywordsMapper.class);
        // Give the job the name of the reducer class
        job.setReducerClass(KeywordsReducer.class);
        // the reducer class can be used as a combiner also in that case
        // because it's extends Reducer
        // and it does the same thing the reducer does
        job.setCombinerClass(KeywordsReducer.class);

        // we don't need to specify the outputkey and value and the mapper output key and value
        // because if they are not specified
        // MapperInputKey type and MapperOutputValue type will be used
        job.setOutputKeyClass(Text.class);

        // run the job
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        String inputPath = "hdfs://localhost:9000/training/lab3/inputs/*";
        String outputPath = "hdfs://localhost:9000/training/lab3/output" ;

        int exitCode = ToolRunner.run(new KeywordsJob(), new String[] { inputPath,outputPath});
        System.exit(exitCode);
    }
}

public class KeywordsJob extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

if (args.length != 2) {

System.err.println("error");

return -1;

}

// getConf method it's inherited from Configured class, that's why we should extends Configured

Configuration conf = getConf();

// because our data is represented in a way that each like contain key, value sperated by a tab

// we need to tell hadoop that the separator is a tab

// by the way if you don't specify it \t it's the default separator used by hadoop

conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "\t");

Job job = Job.getInstance(conf, "TP3");

// Give the job the name of the main class

job.setJarByClass(KeywordsJob.class);

// KeyValueTextFormat it's suitable in our case, where the website url will be the key

// and the list of words that appear in the website separated by a comma are the value

// by the way when using KeyValueTextInputFormat, MapperInputKey and MapperInputValue will be Text

job.setInputFormatClass(KeyValueTextInputFormat.class);

// TextOutputFormat it's suitable in our case

job.setOutputFormatClass(TextOutputFormat.class);

// specify the input paths in the hdfs

TextInputFormat.setInputPaths(job, new Path(args[0]));

// specify the output paths in the hdfs

// We need to ensure that the output file doesn't already exist

// because if we run a job and give it an output that is already exist, the job will fail.

TextOutputFormat.setOutputPath(job, new Path(args[1]));

// Give the job the name of the mapper class

job.setMapperClass(KeywordsMapper.class);

// Give the job the name of the reducer class

job.setReducerClass(KeywordsReducer.class);

// the reducer class can be used as a combiner also in that case

// because it's extends Reducer

// and it does the same thing the reducer does

job.setCombinerClass(KeywordsReducer.class);

// we don't need to specify the outputkey and value and the mapper output key and value

// because if they are not specified

// MapperInputKey type and MapperOutputValue type will be used

job.setOutputKeyClass(Text.class);

// run the job

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

String inputPath = "hdfs://localhost:9000/training/lab3/inputs/*";

String outputPath = "hdfs://localhost:9000/training/lab3/output" ;

int exitCode = ToolRunner.run(new KeywordsJob(), new String[] { inputPath,outputPath});

System.exit(exitCode);

}

Now let’s have a look on the mapper, the role of the mapper is to invert the key and the value. So the mapper get an url as key and list of keywords separated by a comma as a value and on each keyword the mapper write to the output the keyword as key and the url as value.

    public class KeywordsMapper extends Mapper<Text, Text, Text, Text> {

        @Override
        protected void map(Text key, Text values, Mapper.Context context)
            throws IOException, InterruptedException {

            for (String value : values.toString().split(",")) {
                context.write(new Text(value), key);
            }
        }
    }

public class KeywordsMapper extends Mapper<Text, Text, Text, Text> {

@Override

protected void map(Text key, Text values, Mapper.Context context)

throws IOException, InterruptedException {

for (String value : values.toString().split(",")) {

context.write(new Text(value), key);

}

After that, hadoop will perform the shuffling and regroup each (key, value) pairs that have the same key in (key, value 1, value 2 … value n) and pass them to the reducer.

After that each reducer (we may have multiple reducers) will get the key concat the values separated by a comma and write them to hdfs.

    public class KeywordsReducer extends Reducer<Text, Text, Text, Text> {

        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer.Context context)
            throws IOException, InterruptedException {

            StringBuilder sb = new StringBuilder();
            for (Text value : values) {
                sb.append(value);
            }
            sb.setLength(sb.length() - 1);
            context.write(key, new Text(sb.toString()));
        }
    }

public class KeywordsReducer extends Reducer<Text, Text, Text, Text> {

@Override

protected void reduce(Text key, Iterable<Text> values, Reducer.Context context)

throws IOException, InterruptedException {

StringBuilder sb = new StringBuilder();

for (Text value : values) {

sb.append(value);

}

sb.setLength(sb.length() - 1);

context.write(key, new Text(sb.toString()));

}

Now after coding, export the jar as a runnable jar and specify MinMaxJob as a main class, then open terminal and run the job by invoking :

hadoop jar <nameOfTheJar.jar>, for example if you give the jar the name lab1.jar than the command line will be : hadoop jar lab3.jar

Have a look on the result by invoking :

hdfs dfs -cat /training/lab3/output/part-r-00000

Now let’s optimize a little bit our code, let’s change the name of the output file, add the line below to the job class:

    MultipleOutputs.addNamedOutput(job, "anyName", TextOutputFormat.class, Text.class, Text.class);

1	MultipleOutputs.addNamedOutput(job, "anyName", TextOutputFormat.class, Text.class, Text.class);

And in the reducer, instead of using context.write(key, new Text(sb.toString()));, use :

multipleOutputs.write("anyName", key, new Text(sb.toString()));

1

multipleOutputs.write("anyName", key, new Text(sb.toString()));

If you run this job you will get an empty file named part-r-00000, can i remove it ? yes use :

job.setOutputFormatClass(LazyOutputFormat.class);

1

job.setOutputFormatClass(LazyOutputFormat.class);

When you use LazyOutputFormat as an output format, hadoop will not create the file unless you write something in it.

Ok now suppose that I would like to output two different formats, one will be used by an another map reducer as an input (SequenceFileOutputFormat) and the other one will be TextOutputFormat. how to do it ?

- In the job :

    MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, Text.class, Text.class);
    MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, Text.class, Text.class);

1 2	MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, Text.class, Text.class);

- and in the reducer:

    multipleOutputs.write("text", key, new Text(sb.toString()));
    multipleOutputs.write("seq", key, new Text(sb.toString()));

1 2	multipleOutputs.write("text", key, new Text(sb.toString())); multipleOutputs.write("seq", key, new Text(sb.toString()));

Author: Nizar Ellouze

0 Shares

Tags:

hadoop,map reduce,multiple output

Tutorial 3 : Hadoop Map Reduce Multiple Output