Tutoriel 3 : Hadoop Map Reduce Multiple Output

29 Avr

Tutoriel 3 : Hadoop Map Reduce Multiple Output

Le problème d’index inversé est l’une des utilisations les plus anciennes et les plus courantes de MapReduce.

L’IndexInverterJob prend un ensemble de paires <clé, valeur> et l’inverse, de sorte que chaque valeur devient une clé.

Tout d’abord, vous devez vous assurer que vous avez correctement installé Hadoop sur votre machine.

Ensuite, vous devriez démarrer le démon Hadoop en appelant ces scripts:

start-dfs.sh
start-yarn.sh

On termine par cette dernière étape avant de commencer, vous devez copier les fichiers d’entrée dans votre système de fichiers Hadoop local et créer des répertoires dans hdfs avant de les copier.

Alors téléchargez ces deux fichiers d’entrée (ce sont de petits fichiers juste pour les tests)

Après cela, créez des chemins dans hdfs en appelant: hdfs dfs -mkdir -P / formation / lab3 / inputs /

Après cela, copiez les fichiers sur hdfs en appelant cette commande comme suit: hdfs dfs -copyFromLocal <localPathOfFiles> / training / lab3 / inputs /

Par exemple, si vous avez téléchargé les fichiers dans Téléchargements / lab3 / inputs, la ligne de commande doit être la suivante: hdfs dfs -copyFromLocal ~ / Téléchargements / lab3 / inputs / * / training / lab3 / inputs /

Maintenant que tout est déjà configuré, commençons le codage.

public class KeywordsJob extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("error");
            return -1;
        }

        // getConf method it's inherited from Configured class, that's why we should extends Configured
        Configuration conf = getConf();

        // because our data is represented in a way that each like contain key, value sperated by a tab
        // we need to tell hadoop that the separator is a tab
        // by the way if you don't specify it \t it's the default separator used by hadoop
        conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "\t");

        Job job = Job.getInstance(conf, "TP3");
        // Give the job the name of the main class
        job.setJarByClass(KeywordsJob.class);

        // KeyValueTextFormat it's suitable in our case, where the website url will be the key
        // and the list of words that appear in the website separated by a comma are the value
        // by the way when using  KeyValueTextInputFormat, MapperInputKey and MapperInputValue will be Text
        job.setInputFormatClass(KeyValueTextInputFormat.class);

        // TextOutputFormat it's suitable in our case
        job.setOutputFormatClass(TextOutputFormat.class);

        // specify the input paths in the hdfs
        TextInputFormat.setInputPaths(job, new Path(args[0]));
        // specify the output paths in the hdfs
        // We need to ensure that the output file doesn't already exist
        // because if we run a job and give it an output that is already exist, the job will fail.
        TextOutputFormat.setOutputPath(job, new Path(args[1]));

        // Give the job the name of the mapper class
        job.setMapperClass(KeywordsMapper.class);
        // Give the job the name of the reducer class
        job.setReducerClass(KeywordsReducer.class);
        // the reducer class can be used as a combiner also in that case
        // because it's extends Reducer
        // and it does the same thing the reducer does
        job.setCombinerClass(KeywordsReducer.class);

        // we don't need to specify the outputkey and value and the mapper output key and value
        // because if they are not specified
        // MapperInputKey type and MapperOutputValue type will be used
        job.setOutputKeyClass(Text.class);

        // run the job
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        String inputPath = "hdfs://localhost:9000/training/lab3/inputs/*";
        String outputPath = "hdfs://localhost:9000/training/lab3/output" ;

        int exitCode = ToolRunner.run(new KeywordsJob(), new String[] { inputPath,outputPath});
        System.exit(exitCode);
    }
}

public class KeywordsJob extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

if (args.length != 2) {

System.err.println("error");

return -1;

}

// getConf method it's inherited from Configured class, that's why we should extends Configured

Configuration conf = getConf();

// because our data is represented in a way that each like contain key, value sperated by a tab

// we need to tell hadoop that the separator is a tab

// by the way if you don't specify it \t it's the default separator used by hadoop

conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "\t");

Job job = Job.getInstance(conf, "TP3");

// Give the job the name of the main class

job.setJarByClass(KeywordsJob.class);

// KeyValueTextFormat it's suitable in our case, where the website url will be the key

// and the list of words that appear in the website separated by a comma are the value

// by the way when using KeyValueTextInputFormat, MapperInputKey and MapperInputValue will be Text

job.setInputFormatClass(KeyValueTextInputFormat.class);

// TextOutputFormat it's suitable in our case

job.setOutputFormatClass(TextOutputFormat.class);

// specify the input paths in the hdfs

TextInputFormat.setInputPaths(job, new Path(args[0]));

// specify the output paths in the hdfs

// We need to ensure that the output file doesn't already exist

// because if we run a job and give it an output that is already exist, the job will fail.

TextOutputFormat.setOutputPath(job, new Path(args[1]));

// Give the job the name of the mapper class

job.setMapperClass(KeywordsMapper.class);

// Give the job the name of the reducer class

job.setReducerClass(KeywordsReducer.class);

// the reducer class can be used as a combiner also in that case

// because it's extends Reducer

// and it does the same thing the reducer does

job.setCombinerClass(KeywordsReducer.class);

// we don't need to specify the outputkey and value and the mapper output key and value

// because if they are not specified

// MapperInputKey type and MapperOutputValue type will be used

job.setOutputKeyClass(Text.class);

// run the job

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

String inputPath = "hdfs://localhost:9000/training/lab3/inputs/*";

String outputPath = "hdfs://localhost:9000/training/lab3/output" ;

int exitCode = ToolRunner.run(new KeywordsJob(), new String[] { inputPath,outputPath});

System.exit(exitCode);

}

Voyons maintenant le mappeur, le rôle du mappeur est d’inverser la clé et la valeur. Ainsi, le mappeur obtient une URL en tant que clé et une liste de mots-clés séparés par une virgule en tant que valeur et sur chaque mot-clé, le mappeur écrit sur le résultat le mot-clé en tant que clé et l’url en tant que valeur.

    public class KeywordsMapper extends Mapper<Text, Text, Text, Text> {

        @Override
        protected void map(Text key, Text values, Mapper.Context context)
            throws IOException, InterruptedException {

            for (String value : values.toString().split(",")) {
                context.write(new Text(value), key);
            }
        }
    }

public class KeywordsMapper extends Mapper<Text, Text, Text, Text> {

@Override

protected void map(Text key, Text values, Mapper.Context context)

throws IOException, InterruptedException {

for (String value : values.toString().split(",")) {

context.write(new Text(value), key);

}

Après cela, Hadoop effectuera le brassage et regroupera chaque paire (clé, valeur) ayant la même clé (clé, valeur 1, valeur 2… valeur n) et les transmettra au réducteur.

Après cela, chaque réducteur (on a plusieurs réducteurs) obtiendra la clé concat des valeurs séparées par une virgule et les écrira dans hdfs.

    public class KeywordsReducer extends Reducer<Text, Text, Text, Text> {

        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer.Context context)
            throws IOException, InterruptedException {

            StringBuilder sb = new StringBuilder();
            for (Text value : values) {
                sb.append(value);
            }
            sb.setLength(sb.length() - 1);
            context.write(key, new Text(sb.toString()));
        }
    }

public class KeywordsReducer extends Reducer<Text, Text, Text, Text> {

@Override

protected void reduce(Text key, Iterable<Text> values, Reducer.Context context)

throws IOException, InterruptedException {

StringBuilder sb = new StringBuilder();

for (Text value : values) {

sb.append(value);

}

sb.setLength(sb.length() - 1);

context.write(key, new Text(sb.toString()));

}

Maintenant, après avoir codé, exportez le fichier jar en tant que fichier jar exécutable et spécifiez MinMaxJob en tant que classe main, puis ouvrez le terminal et exécutez le travail en appelant: hadoop jar <nameOfTheJar.jar>, par exemple si vous attribuez le nom lab3.jar au jar, la ligne de commande sera: hadoop jar lab3.jar

Regardez le résultat en invoquant : hdfs dfs -cat /training/lab3/output/part-r-00000

Optimisons maintenant un peu notre code, changeons le nom du fichier de sortie « output », ajoutons la ligne ci-dessous à la classe de travail:

    MultipleOutputs.addNamedOutput(job, "anyName", TextOutputFormat.class, Text.class, Text.class);

1	MultipleOutputs.addNamedOutput(job, "anyName", TextOutputFormat.class, Text.class, Text.class);

Et dans le réducteur, au lieu d’utiliser context.write (key, new Text (sb.toString ())) ;

utilisez plutôt :

multipleOutputs.write("anyName", key, new Text(sb.toString()));

1	multipleOutputs.write("anyName", key, new Text(sb.toString()));

Si vous exécutez ce travail, vous obtiendrez un fichier vide nommé part-r-00000, puis-je le supprimer? oui utiliser:

 job.setOutputFormatClass(LazyOutputFormat.class);

1	job.setOutputFormatClass(LazyOutputFormat.class);

Lorsque vous utilisez LazyOutputFormat comme format de sortie « output », Hadoop ne créera pas le fichier, sauf si vous y écrivez quelque chose.

Ok, supposons maintenant que je souhaite sortir deux formats différents, l’un sera utilisé par un autre réducteur de map (SequenceFileOutputFormat) et l’autre sera TextOutputFormat.

Comment le faire ?

Dans la classe de travail :

    MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, Text.class, Text.class);
    MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, Text.class, Text.class);

1 2	MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, Text.class, Text.class);

Et dans le réducteur :

    multipleOutputs.write("text", key, new Text(sb.toString()));
    multipleOutputs.write("seq", key, new Text(sb.toString()));

1 2	multipleOutputs.write("text", key, new Text(sb.toString())); multipleOutputs.write("seq", key, new Text(sb.toString()));

Author: nizell

Partagez

Tweetez

Partagez

0 Partages

Tutoriel 3 : Hadoop Map Reduce Multiple Output

Tutoriel 3 : Hadoop Map Reduce Multiple Output

Author: nizell

No Comments

Post a Comment Cancel Reply

Social Media

Tutorials

Blog