0

I'm doing some work on a block of code that current uses the structure.

HashMap(Text, HashMap(Text, ArrayList(Ints)))

HashMapWritable<Text, ArrayListOfIntsWritable> fileMap =
    new HashMapWritable<Text, ArrayListOfIntsWritable>();
HashMapWritable<Text, HashMapWritable<Text, ArrayListOfIntsWritable>> wordMap =
    new HashMapWritable<Text, HashMapWritable<Text, ArrayListOfIntsWritable>>(); 

And I'm getting weird results but having trouble pinpointing why.

if (!wordMap.containsKey(newText)) {
  ArrayListOfIntsWritable wordPosition = new ArrayListOfIntsWritable();
  wordPosition.add(c);
  fileMap.put(INPUTFILE, wordPosition);
  wordMap.put(newText, fileMap);                    
} else {
  HashMapWritable<Text, ArrayListOfIntsWritable> updatePosStep1 =
      wordMap.get(newText);
  ArrayListOfIntsWritable updatePosStep2 = updatePosStep1.get(INPUTFILE);
  updatePosStep2.add(c);
}

I've also tried updating doing:

wordMap.get(newText).get(INPUTFILE).add(c);

but that gave the same results.

This is all done in a loop and whats happening is this (the example shows the case where 'newText' = 'episod' where the numbers are the position in the loop they are (basic for loop incrementing c) and the [int,int,...] are the values of C that have been stored

 Word: episod curPos 14 positions: [14]
 Word: episod curPos 120 positions: [116, 118, 120]
 Word: episod curPos 191 positions: [186, 190, 191]
 Word: episod curPos 199 positions: [198, 199]

As you can see (hopefully it shows what I'm trying to get across), the values for the key episod get reset at some point prior. This is the same with all words so when its finished running, all words have the same few sets of integers..

Is it something obvious I'm doing wrong?

Philipp Reichart
  • 20,085
  • 5
  • 54
  • 64
rogy
  • 418
  • 5
  • 22
  • What classes are ArrayListOfIntsWritable and HashMapWritable? – Kayaman Feb 02 '14 at 12:16
  • Writable extension of a Java HashMap. This generic class supports the use of any type as either key or value. For a feature vector, HMapKIW, HMapKFW, and a family of related classes provides a more efficient implementation. There are a number of key differences between this class and Hadoop's MapWritable: MapWritable is more flexible in that it supports heterogeneous elements. In this class, all keys must be of the same type and all values must be of the same type. This assumption allows a simpler serialization protocol and thus is more efficient. – rogy Feb 02 '14 at 12:21
  • Writable extension of the ArrayListOfInts class. This class provides an efficient data structure to store a list of ints for MapReduce jobs. (https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/util/array/ArrayListOfInts.java) – rogy Feb 02 '14 at 12:23
  • Is `Text` a mutable class (so the `hashCode()` might produce different results and therefore store a new key/value pair into the map)? If so, you should replace it with an immutable one as [mutable keys shouldn't be used for keys](http://stackoverflow.com/questions/7842049/are-mutable-hashmap-keys-a-dangerous-practice). Probably all you have to do is to override `hashCode()`, `equals(Object)` and maybe `toString()` of `Text` – Roman Vottner Feb 02 '14 at 12:40
  • That was my first thought Roman, but it doesnt reinitialise the word if it sees it again, i did some logging to double check and once its been instantiated once, it moves on to the 'else' so its something to do with the nested .add (i think) – rogy Feb 02 '14 at 12:44
  • So the values are actually coming from a map-reduce? Don't the nodes create their own instance of a `Text` object which they return with the position of the occurrence of this text within a segment/page? The example, at least to me, looks like either the value is really overwritten in case `Text` produces the same hash-value for the same words (then you should check your hash-map implementation) or new values are added to the map as they are different objects with different hash-values. – Roman Vottner Feb 02 '14 at 12:51
  • The hash values from the `Text` objects are the same so 'the same' word isn't added to the map repeatedly, I'm not entirely sure what you mean in the other case? – rogy Feb 02 '14 at 13:18

1 Answers1

1

It's really confusing, that you're using outer-way-created object "fileMap" at 'true' condition, while receiving 'updatePosStep1' from map at 'false'. Guessing, that you're sharing same instance of 'fileMap', or recreating/resetting 'fileMap' on some condition before the presented code.

So, you might accidentally have extra values in 'wordPosition', as well as have 'wordPosition' recreated at every new 'newText'.

if (!wordMap.containsKey(newText)) {
  final HashMapWritable<Text, ArrayListOfIntsWritable> fileMap = new HashMapWritable<Text, ArrayListOfIntsWritable>;
  final ArrayListOfIntsWritable wordPosition = new ArrayListOfIntsWritable();
  wordPosition.add(c);
  fileMap.put(INPUTFILE, wordPosition);
  wordMap.put(newText, fileMap);                    
} else {
  final HashMapWritable<Text, ArrayListOfIntsWritable> fileMap =
      wordMap.get(newText);
  final ArrayListOfIntsWritable wordPosition = fileMap.get(INPUTFILE);
  wordPosition.add(c);
}
vlasov
  • 88
  • 1
  • 7
  • Oh you lovely person. you are absolutely right, it makes sense now, the reuse of filemap totally threw the whole lot off, you have saved my day from being a total waste – rogy Feb 02 '14 at 20:11