20

I am working with a TreeMap of Strings TreeMap<String, String>, and using it to implement a Dictionay of words.

I then have a collection of files, and would like to create a representation of each file in the vector space (space of words) defined by the dictionary.

Each file should have a vector representing it with following properties:

  • vector should have same size as dictionary
  • for each word contained in the file the vector should have a 1 in the position corresponding to the word position in dictionary
  • for each word not contained in the file the vector should have a -1 in the position corresponding to the word position in dictionary

So my idea is to use a Vector<Boolean> to implement these vectors. (This way of representing documents in a collection is called Boolean Model - http://www.site.uottawa.ca/~diana/csi4107/L3.pdf)

The problem I am facing in the procedure to create this vector is that I need a way to find position of a word in the dictionary, something like this:

String key;
int i = get_position_of_key_in_Treemap(key); <--- purely invented method...

1) Is there any method like this I can use on a TreeMap?If not could you provide some code to help me implement it by myself?

2) Is there an iterator on TreeMap (it's alphabetically ordered on keys) of which I can get position?

3)Eventually should I use another class to implement dictionary?(If you think that with TreeMaps I can't do what I need) If yes, which?

Thanks in advance.

ADDED PART:

Solution proposed by dasblinkenlight looks fine but has the problem of complexity (linear with dimension of dictionary due to copying keys into an array), and the idea of doing it for each file is not acceptable.

Any other ideas for my questions?

Bhesh Gurung
  • 48,464
  • 20
  • 87
  • 139
Matteo
  • 6,694
  • 21
  • 75
  • 123

8 Answers8

21

Once you have constructed your tree map, copy its sorted keys into an array, and use Arrays.binarySearch to look up the index in O(logN) time. If you need the value, do a lookup on the original map too.

Edit: this is how you copy keys into an array

String[] mapKeys = new String[treeMap.size()];
int pos = 0;
for (String key : treeMap.keySet()) {
    mapKeys[pos++] = key;
}
Sergey Kalinichenko
  • 675,664
  • 71
  • 998
  • 1,399
  • `copy its sorted keys into an array` how do you do that? – Matteo Dec 14 '11 at 10:11
  • @Matteo I added an example of how it can be done to the answer. – Sergey Kalinichenko Dec 14 '11 at 10:17
  • I saw your procedure, but it has cost N (copying keys into an array), and it's not thinkable to do it for each file. Any other idea? Is there any method like this I can use on a TreeMap? Is there an iterator on TreeMap (it's alphabetically ordered on keys) of which I can get position? Should I use another class to implement dictionary? – Matteo Dec 14 '11 at 11:35
  • @Matteo You do not need to do it for each file: you do it once for your dictionary `TreeMap`, and keep that array between reading the files. P.S. I'm sorry, I did not discover your post until today, because you did not put @dasblinkenlight in front of it. – Sergey Kalinichenko Dec 21 '11 at 17:21
  • This is probably the best answer. TreeMap doesn't have an index, it *is* a Map, after all. :) You could even make your own map class that provides this behavior. Also, Vector is very...1999 ;) – Joshua Davis Dec 23 '11 at 20:32
  • @JoshuaDavis Well, you can simulate the index with tree maps by counting nodes of `headMap` (see my second answer at the bottom of the page). It works, but its efficiency is heavily dependent on the implementation. – Sergey Kalinichenko Dec 23 '11 at 20:36
  • @dashblinkenlight gave you the bounty cause idea of defining the order not on `Map` but on the `Array` was very useful!! – Matteo Dec 24 '11 at 16:06
  • @Matteo Thank you very much! You may want to check out my other solution below (the one based on `headMap`), just for the completeness of the picture. To me, finding that other solution was a valuable learning experience. Happy Holidays! – Sergey Kalinichenko Dec 24 '11 at 16:30
5

An alternative solution would be to use TreeMap's headMap method. If the word exists in the TreeMap, then the size() of its head map is equal to the index of the word in the dictionary. It may be a bit wasteful compared to my other answer, through.

Here is how you code it in Java:

import java.util.*;

class Test {
    public static void main(String[] args) {
        TreeMap<String,String> tm = new TreeMap<String,String>();
        tm.put("quick", "one");
        tm.put("brown", "two");
        tm.put("fox", "three");
        tm.put("jumps", "four");
        tm.put("over", "five");
        tm.put("the", "six");
        tm.put("lazy", "seven");
        tm.put("dog", "eight");
        for (String s : new String[] {
            "quick", "brown", "fox", "jumps", "over",
            "the", "lazy", "dog", "before", "way_after"}
        ) {
            if (tm.containsKey(s)) {
                // Here is the operation you are looking for.
                // It does not work for items not in the dictionary.
                int pos = tm.headMap(s).size();
                System.out.println("Key '"+s+"' is at the position "+pos);
            } else {
                System.out.println("Key '"+s+"' is not found");
            }
        }
    }
}

Here is the output produced by the program:

Key 'quick' is at the position 6
Key 'brown' is at the position 0
Key 'fox' is at the position 2
Key 'jumps' is at the position 3
Key 'over' is at the position 5
Key 'the' is at the position 7
Key 'lazy' is at the position 4
Key 'dog' is at the position 1
Key 'before' is not found
Key 'way_after' is not found
Sergey Kalinichenko
  • 675,664
  • 71
  • 998
  • 1,399
3

https://github.com/geniot/indexed-tree-map

I had the same problem. So I took the source code of java.util.TreeMap and wrote IndexedTreeMap. It implements my own IndexedNavigableMap:

public interface IndexedNavigableMap<K, V> extends NavigableMap<K, V> {
   K exactKey(int index);
   Entry<K, V> exactEntry(int index);
   int keyIndex(K k);
}

The implementation is based on updating node weights in the red-black tree when it is changed. Weight is the number of child nodes beneath a given node, plus one - self. For example when a tree is rotated to the left:

    private void rotateLeft(Entry<K, V> p) {
    if (p != null) {
        Entry<K, V> r = p.right;

        int delta = getWeight(r.left) - getWeight(p.right);
        p.right = r.left;
        p.updateWeight(delta);

        if (r.left != null) {
            r.left.parent = p;
        }

        r.parent = p.parent;


        if (p.parent == null) {
            root = r;
        } else if (p.parent.left == p) {
            delta = getWeight(r) - getWeight(p.parent.left);
            p.parent.left = r;
            p.parent.updateWeight(delta);
        } else {
            delta = getWeight(r) - getWeight(p.parent.right);
            p.parent.right = r;
            p.parent.updateWeight(delta);
        }

        delta = getWeight(p) - getWeight(r.left);
        r.left = p;
        r.updateWeight(delta);

        p.parent = r;
    }
  }

updateWeight simply updates weights up to the root:

   void updateWeight(int delta) {
        weight += delta;
        Entry<K, V> p = parent;
        while (p != null) {
            p.weight += delta;
            p = p.parent;
        }
    }

And when we need to find the element by index here is the implementation that uses weights:

public K exactKey(int index) {
    if (index < 0 || index > size() - 1) {
        throw new ArrayIndexOutOfBoundsException();
    }
    return getExactKey(root, index);
}

private K getExactKey(Entry<K, V> e, int index) {
    if (e.left == null && index == 0) {
        return e.key;
    }
    if (e.left == null && e.right == null) {
        return e.key;
    }
    if (e.left != null && e.left.weight > index) {
        return getExactKey(e.left, index);
    }
    if (e.left != null && e.left.weight == index) {
        return e.key;
    }
    return getExactKey(e.right, index - (e.left == null ? 0 : e.left.weight) - 1);
}

Also comes in very handy finding the index of a key:

    public int keyIndex(K key) {
    if (key == null) {
        throw new NullPointerException();
    }
    Entry<K, V> e = getEntry(key);
    if (e == null) {
        throw new NullPointerException();
    }
    if (e == root) {
        return getWeight(e) - getWeight(e.right) - 1;//index to return
    }
    int index = 0;
    int cmp;
    if (e.left != null) {
        index += getWeight(e.left);
    }
    Entry<K, V> p = e.parent;
    // split comparator and comparable paths
    Comparator<? super K> cpr = comparator;
    if (cpr != null) {
        while (p != null) {
            cmp = cpr.compare(key, p.key);
            if (cmp > 0) {
                index += getWeight(p.left) + 1;
            }
            p = p.parent;
        }
    } else {
        Comparable<? super K> k = (Comparable<? super K>) key;
        while (p != null) {
            if (k.compareTo(p.key) > 0) {
                index += getWeight(p.left) + 1;
            }
            p = p.parent;
        }
    }
    return index;
}

You can find the result of this work at https://github.com/geniot/indexed-tree-map

2

There's no such implementation in the JDK itself. Although TreeMap iterates in natural key ordering, its internal data structures are all based on trees and not arrays (remember that Maps do not order keys, by definition, in spite of that the very common use case).

That said, you have to make a choice as it is not possible to have O(1) computation time for your comparison criteria both for insertion into the Map and the indexOf(key) calculation. This is due to the fact that lexicographical order is not stable in a mutable data structure (as opposed to insertion order, for instance). An example: once you insert the first key-value pair (entry) into the map, its position will always be one. However, depending on the second key inserted, that position might change as the new key may be "greater" or "lower" than the one in the Map. You can surely implement this by maintaining and updating an indexed list of keys during the insertion operation, but then you'll have O(n log(n)) for your insert operations (as will need to re-order an array). That might be desirable or not, depending on your data access patterns.

ListOrderedMap and LinkedMap in Apache Commons both come close to what you need but rely on insertion order. You can check out their implementation and develop your own solution to the problem with little to moderate effort, I believe (that should be just a matter of replacing the ListOrderedMaps internal backing array with a sorted list - TreeList in Apache Commons, for instance).

You can also calculate the index yourself, by subtracting the number of elements that are lower than then given key (which should be faster than iterating through the list searching for your element, in the most frequent case - as you're not comparing anything).

lsoliveira
  • 4,222
  • 3
  • 18
  • 30
2

I agree with Isolvieira. Perhaps the best approach would be to use a different structure than TreeMap.

However, if you still want to go with computing the index of the keys, a solution would be to count how many keys are lower than the key you are looking for.

Here is a code snippet:

    java.util.SortedMap<String, String> treeMap = new java.util.TreeMap<String, String>();
    treeMap.put("d", "content 4");
    treeMap.put("b", "content 2");
    treeMap.put("c", "content 3");
    treeMap.put("a", "content 1");

    String key = "d"; // key to get the index for
    System.out.println( treeMap.keySet() );

    final String firstKey = treeMap.firstKey(); // assuming treeMap structure doesn't change in the mean time
    System.out.format( "Index of %s is %d %n", key, treeMap.subMap(firstKey, key).size() );
2

I'd like to thank all of you for the effort you put in answering my question, they all were very useful and taking the best from each of them made me come up to the solution I actually implemented in my project.


What I beleive to be best answers to my single questions are:

2) There is not an Iterator defined on TreeMaps as @Isoliveira sais:

There's no such implementation in the JDK itself. 
Although TreeMap iterates in natural key ordering,
its internal data structures are all based on trees and not arrays
(remember that Maps do not order keys, by definition, 
in spite of that the very common use case).

and as I found in this SO answer How to iterate over a TreeMap?, the only way to iterate on elements in a Map is to use map.entrySet() and use Iterators defined on Set (or some other class with Iterators).


3) It's possible to use a TreeMap to implement Dictionary, but this will garantuee a complexity of O(logN) in finding index of a contained word (cost of a lookup in a Tree Data Structure).

Using a HashMap with same procedure will instead have complexity O(1).


1) There exists no such method. Only solution is to implement it entirely.

As @Paul stated

Assumes that once getPosition() has been called, the dictionary is not changed.

assumption of solution is that once that Dictionary is created it will not be changed afterwards: in this way position of a word will always be the same.

Giving this assumption I found a solution that allows to build Dictionary with complexity O(N) and after garantuees the possibility to get index of a word contained with constat time O(1) in lookup.

I defined Dictionary as a HashMap like this:

public HashMap<String, WordStruct> dictionary = new HashMap<String, WordStruct>();
  • key --> the String representing the word contained in Dictionary
  • value --> an Object of a created class WordStruct

where WordStruct class is defined like this:

public class WordStruct {

    private int DictionaryPosition;    // defines the position of word in dictionary once it is alphabetically ordered

    public WordStruct(){

    }

    public SetWordPosition(int pos){
        this.DictionaryPosition = pos;
    }

}

and allows me to keep memory of any kind of attribute I like to couple with the word entry of the Dictionary.

Now I fill dictionary iterating over all words contained in all files of my collection:

THE FOLLOWING IS PSEUDOCODE

for(int i = 0; i < number_of_files ; i++){

        get_file(i);

        while (file_contais_words){

            dictionary.put( word(j) , new LemmaStruct());

        }

}   

Once HashMap is filled in whatever order I use procedure indicated by @dasblinkenlight to order it once and for all with complexity O(N)

    Object[] dictionaryArray = dictionary.keySet().toArray();
    Arrays.sort(dictionaryArray);

    for(int i = 0; i < dictionaryArray.length; i++){

        String word = (String) dictionaryArray[i];
        dictionary.get(word).SetWordPosition(i);

    }

And from now on to have index position in alphatebetic order of word in dictionary only thing needed is to acces it's variable DictionaryPosition:

since word is know you just need to access it and this has constant cost in a HashMap.


Thanks again and Iwish you all a Merry Christmas!!

Community
  • 1
  • 1
Matteo
  • 6,694
  • 21
  • 75
  • 123
1

Have you thought to make the values in your TreeMap contain the position in your dictionary? I am using a BitSet here for my file details.

This doesn't work nearly as well as my other idea below.

Map<String,Integer> dictionary = new TreeMap<String,Integer> ();

private void test () {
  // Construct my dictionary.
  buildDictionary();
  // Make my file data.
  String [] file1 = new String[] {
    "1", "3", "5"
  };
  BitSet fileDetails = getFileDetails(file1, dictionary);
  printFileDetails("File1", fileDetails);
}

private void printFileDetails(String fileName, BitSet details) {
  System.out.println("File: "+fileName);
  for ( int i = 0; i < details.length(); i++ ) {
    System.out.print ( details.get(i) ? 1: -1 );
    if ( i < details.length() - 1 ) {
      System.out.print ( "," );
    }
  }
}

private BitSet getFileDetails(String [] file, Map<String, Integer> dictionary ) {
  BitSet details = new BitSet();
  for ( String word : file ) {
    // The value in the dictionary is the index of the word in the dictionary.
    details.set(dictionary.get(word));
  }
  return details;
}

String [] dictionaryWords = new String[] {
  "1", "2", "3", "4", "5"
};

private void buildDictionary () {
  for ( String word : dictionaryWords ) {
    // Initially make the value 0. We will change that later.
    dictionary.put(word, 0);
  }
  // Make the indexes.
  int wordNum = 0;
  for ( String word : dictionary.keySet() ) {
    dictionary.put(word, wordNum++);
  }
}

Here the building of the file details consists of a single lookup in the TreeMap for each word in the file.

If you were planning to use the value in the dictionary TreeMap for something else you could always compose it with an Integer.

Added

Thinking about it further, if the value field of the Map is earmarked for something you could always use special keys that calculate their own position in the Map and act just like Strings for comparison.

private void test () {
  // Dictionary
  Map<PosKey, String> dictionary = new TreeMap<PosKey, String> ();
  // Fill it with words.
  String[] dictWords = new String[] {
                       "0", "1", "2", "3", "4", "5"};
  for ( String word : dictWords ) {
    dictionary.put( new PosKey( dictionary, word ), word );
  }
  // File
  String[] fileWords = new String[] {
                       "0", "2", "3", "5"};
  int[] file = new int[dictionary.size()];
  // Initially all -1.
  for ( int i = 0; i < file.length; i++ ) {
    file[i] = -1;
  }
  // Temp file words set.
  Set fileSet = new HashSet( Arrays.asList( fileWords ) );
  for ( PosKey key : dictionary.keySet() ) {
    if ( fileSet.contains( key.getKey() ) ) {
      file[key.getPosiion()] = 1;
    }
  }

  // Print out.
  System.out.println( Arrays.toString( file ) );
  // Prints: [1, -1, 1, 1, -1, 1]

}

class PosKey
    implements Comparable {
  final String key;
  // Initially -1
  int position = -1;
  // The map I am keying on.
  Map<PosKey, ?> map;

  public PosKey ( Map<PosKey, ?> map, String word ) {
    this.key = word;
    this.map = map;
  }

  public int getPosiion () {
    if ( position == -1 ) {
      // First access to the key.
      int pos = 0;
      // Calculate all positions in one loop.
      for ( PosKey k : map.keySet() ) {
        k.position = pos++;
      }
    }
    return position;
  }

  public String getKey () {
    return key;
  }

  public int compareTo ( Object it ) {
    return key.compareTo( ( ( PosKey )it ).key );
  }

  public int hashCode () {
    return key.hashCode();
  }
}

NB: Assumes that once getPosition() has been called, the dictionary is not changed.

OldCurmudgeon
  • 60,862
  • 15
  • 108
  • 197
0

I would suggest that you write a SkipList to store your dictionary, since this will still offer O(log N) lookups, insertion and removal while also being able to provide an index (tree implementations can generally not return an index since the nodes don't know it, and there would be a cost to keeping them updated). Unfortunately the java implementation of ConcurrentSkipListMap does not provide an index, so you would need to implement your own version.

Getting the index of an item would be O(log N), if you wanted both the index and value without doing 2 lookups then you would need to return a wrapper object holding both.

Trevor Freeman
  • 6,672
  • 2
  • 19
  • 39