Categorizing words according to their length on reducer

Question

I am new on MapReduce applications. I'm simply trying to find words' length on my data set and categorize them as tiny, little, med, huge according to their length and in the end, I want to see the total number how many words are tiny, little, med or huge on my data set in Java but I have a problem on implementing reducer. When I execute jar file on Hadoop cluster, it doesn't return any result. If somebody give a hand to me, I will be grateful. Here is the reducer code that I try to execute but have lots of error I guess.

public class WordSizeReducer extends Reducer<IntWritable, IntWritable, Text, IntWritable> {
    private IntVariable result = new IntVariable();
    IntWritable tin, smal, mediu,bi;
    int t, s, m, b;
    int count;
    Text tiny, small, medium, big;

    public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{

        for (IntWritable val:values){   
            if(val.get() == 1){
                tin.set(t);
                t++;                            
                }
            else if(2<=val.get() && val.get()<=4){
                smal.set(s);
                s++;                
                }
            else if(5<=val.get() && val.get()<=9){
                mediu.set(m);
                m++;                
                }
            else if(10<=val.get()){
                bi.set(b);
                b++;    }

        }       
        context.write(tiny, tin);
        context.write(small, smal);
        context.write(medium, mediu);
        context.write(big, bi); 
    }
}

public class WordSizeMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private IntWritable wordLength = new IntWritable();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            wordLength.set(tokenizer.nextToken().length());
            context.write(wordLength, one);     
        }
    }
}

DNA · Accepted Answer · 2015-02-01T17:12:54.760

tiny, small, medium and big are never initialised, so they will be null.

This means that all your context.write() calls are using a null key.

Clearly, this is no good, since you won't be able to distinguish between the counts for the different word sizes.

Even worse, tin, smal, mediu, bi are never initialised, which will cause NullPointerExceptions when you try to call set() on them (you initialise result correctly, but then never use it).

(Also, you don't need to set the IntWritables repeatedly within your loop over the values; just update t,s,m,b then set the IntWritables once at the end before the context.write() calls)

Update now that mapper code added:

For each word in the input, you are writing key-value pairs (length, 1).

The reducer will collect all the values with the same key, so it will be called with, for example:

(2, [1,1,1,1,1,1,1,1,])
(3, [1,1,1])

So your reducer will only ever see the value '1', which it is incorrectly treating as a word length. Actually, the key is the word length.

Update now that stack trace added:

The error message explains what is wrong - Hadoop cannot find your job classes, so they are not being executed at all. The error says:

java.lang.ClassNotFoundException: WordSize.WordsizeMapper

but your class is called WordSizeMapper (or maybe WordSize.WordSizeMapper if you have an outer class) - note the different capitalisation of "size"/"Size"! You need to check how you are invoking Hadoop.

Ooops you're right. I've changed the points you've said but I think I have a problem with the logic, how could I fix the code, I couldn't get it. — Sarlken Konig, Jan 31 '15 at 23:37
Each of `tiny`, `small`, `medium` and `big` need to be set to different text, e.g. `Text tiny = new Text("tiny");` (Also, it's best to avoid updating the code in your question, so that answers still make sense when people read them later!) — DNA, Jan 31 '15 at 23:42
You might also want to add some logging so that you can see what is going on. — DNA, Jan 31 '15 at 23:45
I've fixed the code with its first version. I changed the text initialization also but it makes any sense on cluster. I have added mapper class to the code also. — Sarlken Konig, Feb 01 '15 at 00:00

score 0 · Answer 2 · answered Feb 01 '15 at 10:46

No way, I have also check my code, I have done some fixes but the result is the same, on hadoop terminal window, I can not get any result. The last version of the code is below:

    public class WordSizeTest {
    public static void main(String[] args) throws Exception{
        if(args.length != 2)
        {
            System.err.println("Usage: Word Size <in> <out>");
            System.exit(2);
        } 
        Job job = new Job();    
        job.setJarByClass(WordSizeTest.class); 
        job.setJobName("Word Size");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(WordSizeMapper.class); 
        job.setReducerClass(WordSizeReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class); 
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
public class WordSizeMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
    final static IntWritable one = new IntWritable(1);
    IntWritable wordLength = new IntWritable();
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
    {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            wordLength.set(tokenizer.nextToken().length());
            context.write(wordLength, one);     
    }
    }
}
public class WordSizeReducer extends Reducer<IntWritable, IntWritable, Text, IntWritable>{
    IntWritable tin = new IntWritable();
    IntWritable smal = new IntWritable();
    IntWritable mediu = new IntWritable();
    IntWritable bi = new IntWritable();
    int t, s, m, b;
    Text tiny = new Text("tiny");
    Text small = new Text("small");
    Text medium = new Text("medium");
    Text big = new Text("big");
    public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{        
        for (IntWritable val:values){
            if(key.get() == 1){
                t += val.get();                         
                }
            else if(2<=key.get() && key.get()<=4){
                s += val.get();             
                }
            else if(5<=key.get() && key.get()<=9){
                m += val.get();             
                }
            else if(10<=key.get()){
                b += val.get();             
                }

        }
        tin.set(t); 
        smal.set(s);
        mediu.set(m);
        bi.set(b);
        context.write(tiny, tin);
        context.write(small, smal);
        context.write(medium, mediu);
        context.write(big, bi); 
    }
    }

The error on terminal is like that,

15/02/01 12:09:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/02/01 12:09:25 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/02/01 12:09:25 INFO input.FileInputFormat: Total input paths to process : 925
15/02/01 12:09:25 WARN snappy.LoadSnappy: Snappy native library is available
15/02/01 12:09:25 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/02/01 12:09:25 INFO snappy.LoadSnappy: Snappy native library loaded
15/02/01 12:09:29 INFO mapred.JobClient: Running job: job_201501191143_0177
15/02/01 12:09:30 INFO mapred.JobClient:  map 0% reduce 0%
15/02/01 12:09:47 INFO mapred.JobClient: Task Id : attempt_201501191143_0177_m_000001_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: WordSize.WordSizeMapper
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:859)
    at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(AccessController.java:310)
    at javax.security.auth.Subject.doAs(Subject.java:573)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: WordSize.WordsizeMapper
    at java.lang.Class.forName(Class.java:174)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:812)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
    ... 8 more
15/02/01 12:09:49 INFO mapred.JobClient: Task Id : attempt_201501191143_0177_m_000000_0, Status : FAILED

Categorizing words according to their length on reducer

2 Answers2