Here is an example of what your 3 methods should be.
This is what is called an MCVE (Minimal, Complete, and Verifiable example).
The primary concept is that constructing a new combined string by calling toString()
is not the right way to implement equals()
and hashCode()
.
This is mostly for performance reasons. With 18 million objects added to the map, performance of equals()
and hashCode()
matters.
Another reason it's a bad idea, is that toString()
may not show all values of the object, or that two different objects may still have the same toString()
result, e.g. if tokens could contain spaces, then concatenating them would not be able to differentiate A B
, C
from A
, B C
.
For the example, a String[]
is used for the previousToken
field, and a varargs constructor is used to simplify testing. Null tokens not allowed. Yours is likely slightly different, but this shows the concepts.
public class NGram {
private String token;
private String[] previousToken; // could also be List<String>
public NGram(String ... allTokens) {
if (allTokens.length == 0)
throw new IllegalArgumentException("At least one token is required");
for (String s : allTokens)
if (s == null)
throw new NullPointerException();
this.token = allTokens[allTokens.length - 1];
this.previousToken = Arrays.copyOfRange(allTokens, 0, allTokens.length - 1);
}
@Override
public String toString() {
StringBuilder buf = new StringBuilder();
for (String s : this.previousToken)
buf.append(s).append(' ');
return buf.append(this.token).toString();
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null || getClass() != obj.getClass())
return false;
NGram that = (NGram) obj;
return this.token.equals(that.token) &&
Arrays.equals(this.previousToken, that.previousToken);
}
@Override
public int hashCode() {
return this.token.hashCode() + 31 * Arrays.hashCode(this.previousToken);
}
}
TEST
This shows a faster, more compact version of the code to add a new object to the map. It is faster because it only performs one lookup per add, and made more compact by using a ternary operator instead of an if
statement, combining the two put()
calls.
It's isolated in a helper method for easier testing.
public static void main(String[] args) {
Map<NGram, Integer> countMap = new HashMap<>();
add(countMap, new NGram("Foo"));
add(countMap, new NGram("Bar"));
add(countMap, new NGram("Foo", "Bar"));
add(countMap, new NGram("This", "is", "a", "test"));
add(countMap, new NGram("Bar"));
System.out.println(countMap);
}
private static void add(Map<NGram, Integer> countMap, NGram ng) {
Integer counter = countMap.get(ng);
countMap.put(ng, (counter != null ? counter + 1 : 1));
}
OUTPUT
{Bar=2, Foo=1, This is a test=1, Foo Bar=1}
As you can see, the duplicate Bar
value is counted twice.
If you end up with lots of duplicate objects, i.e. with high counter values, then using an immutable Integer
object as the counter is not great for performance. You can use an int[1]
, but a mutable helper class is better, since it can provide a good toString()
implementation.
private static void add(Map<NGram, Counter> countMap, NGram ng) {
Counter counter = countMap.get(ng);
if (counter == null)
countMap.put(ng, new Counter(1));
else
counter.increment();
}
private static final class Counter {
private int count;
public Counter(int count) {
this.count = count;
}
public void increment() {
this.count++;
}
public int get() {
return this.count;
}
@Override
public String toString() {
return Integer.toString(this.count);
}
}