Overriding hashCode() to be consistent with equals() when equals() uses a similarity metric

Question

Let say I have a class Car with fields color and model. I need to store cars in a collection in which I will have no duplicates (no 2 same cars). In the example below I am using a HashMap.

According to the Java documentation, if we have 2 Car objects car1 and car2 such that car1.equals(car2) == true, then it must also hold that car1.hashCode() == car2.hashCode(). So in this example, if I wanted to compare cars just by their color then I would use only the color field in equals() and hashCode(), as I did it in my code, and it works perfectly fine.

public class Car {
String color;
String model;

@Override
public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ((color == null) ? 0 : color.hashCode());
    return result;
}

@Override
public boolean equals(Object obj) {
    if (this == obj)
        return true;
    if (obj == null)
        return false;
    if (getClass() != obj.getClass())
        return false;
    Car other = (Car) obj;
    if (color == null) {
        if (other.color != null)
            return false;
    } else if (!color.equals(other.color))
        return false;
    return true;
}

public Car(String color, String model) {
    super();
    this.color = color;
    this.model = model;
}

@Override
public String toString() {
    return color + "\t" + model;
}

public static void main(String[] args) {
    Map<Car, Car> cars = new HashMap<Car, Car>();
    Car a = new Car("red", "audi");
    Car b = new Car("red", "bmw");
    Car c = new Car("blue", "audi");
    cars.put(a, a);
    cars.put(b, b);
    cars.put(c, c);
    for(Car car : cars.keySet()) {
        System.out.println(cars.get(car));
    }

}

}

The output is:

red bmw

blue audi

as expected.

So good so far. Now, I am experimenting with other ways for comparing 2 cars. I have provided a function to measure similarity between 2 cars. For the sake of the argument let say I have a method double similarity(Car car1, Car car2) which returns a double value in the interval [0,1]. I consider 2 cars to be equal if their similarity function returns value greater than 0.5. Then, I override the equals method:

@Override
public boolean equals(Object obj) {
    Car other = (Car) obj;
    return similarity(this, other) > 0.5;
}

Now, I don't know how to override the hashCode() to be sure that always will hold the hashCode - equals contract, e.g. 2 equal objects to have always equal hashCodes.

I have been thinking of using TreeMap instead of HashMap, just to avoid overriding the hashCode because I have no idea how to do it properly. But, I don't need any sorting, so I find using TreeMap in this problem not appropriate, and I think it would be more expensive in terms of complexity.

It would be very helpful if you could suggest me: a way of overriding the hashCode or an alternative of a different structure which would be more appropriate for my problem.

Thank you in advance!

What defines `similarity() > 0.5`? Once we know that, then we can construct a new `hashCode()` — jlewkovich, Feb 09 '15 at 00:57
@J This was actually a simplified version, because the real project I work on is more complex. For this problem it may make no sense, but technically speaking, let say the similarity function defines a string similarity between the colors. For example, if 2 cars have colors "blue" and "light blue" than it would return some value greater than 0.5, but if the colors were "blue" and "red" it would return 0. — giliev, Feb 09 '15 at 01:09
As your `equals` method will violate the general contract for `equals`, at heart this question is a duplicate of http://stackoverflow.com/questions/27581/what-issues-should-be-considered-when-overriding-equals-and-hashcode-in-java — Raedwald, Feb 12 '15 at 08:01

score 4 · Answer 1 · answered Feb 09 '15 at 01:08

4

Although sprinter has covered some of the issues with your strategy, there is a more contract-based issue with your method. According to the Javadoc,

[equals] is transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true

However, x can be similar to y and y can be similar to z with x being too far from z to be similar, so your equals method doesn't work.

answered Feb 09 '15 at 01:08

k_g

4,019
2
20
37

Even though the contractual aspects of equals are important, in this case it is best to steer the OP away from hacking the equals and hashcode methods as described in the post. It is not good design. – M.K. Feb 09 '15 at 08:30

score 4 · Answer 2 · answered Feb 09 '15 at 01:17

4

You should not tamper with the equals and hashcode methods this way. The Collection data structures depend on these methods and using them in the non-standard way will give unexpected behaviour.

I suggest you create a Comparator implementation which will compare two cars or implement the Comparable interface where you can use your similarity method underneath.

answered Feb 09 '15 at 01:17

M.K.

9,275
6
34
85

Thanks for the suggestion! The transitiveness of equals (mentioned in the other answers) will still be a problem, but I think my solution won't be affected very much, so I guess I will try out this Comparator based solution. After all it won't affect any other piece of my code. – giliev Feb 09 '15 at 12:12
If you are not sure how to implement the equals and hashcode methods, most IDEs can auto-generate these methods for your class. Eclipse and Intellij can both auto-generate them for you. – M.K. Feb 09 '15 at 12:19
Transitivity is an issue when you are dealing with extensible classes. That is when you are dealing with an inheritance hierarchy. If you are not, then the usual implementation of equals() and hashcode() will be fine for you. Read this article for more info: http://www.artima.com/lejava/articles/equality.html – M.K. Feb 09 '15 at 12:33

score 3 · Answer 3 · answered Feb 09 '15 at 00:59

There are a couple of points to make here.

The first is that this is an unusual usage of equals. In general equals is interpreted to mean that these are two instances of the same object; one can replace another without impact.

The second point is that a.equals(b) implies that a.hashCode() == b.hashCode() but not the reverse. In fact it is perfectly legal (though pointless) to have all objects return the same hash code. So in your case as long as all sufficiently similar cars return the same hash code the various collections will operate correctly.

I suspect it's more likely that you should have a separate class to represent your 'similar' concept. You can then test equality of similarity or map for similar to lists of cars. That might be a better representation of the concept than overloading equals for cars.

Matt McHenry · Answer 4 · 2015-02-09T01:55:09.167

hashCode() is just a "short cut" for equals(). It's important to make sure the scheme you're working towards makes sense for equals. Consider cars a, b, and c, where similarity(a, b) == 0.3 and similarity(b, c) == 0.3.

But what if similarity(a, c) == 0.6? Then you're in a situation where a.equals(b) and b.equals(c), but mysteriously a.equals(c) is false.

This violates the general contract of Object.equals(). When this happens, parts of the standard library like HashMap and TreeMap will suddenly start to behave very strangely.

If you're interested in plugging in different sorting schemes, you're much better off working with different Comparator<Car>s that each implement your scheme. While the same restriction applies in the Comparator API¹, it lets you represent less than and greater than, which it sounds like you're really after and which can't be done via Object.equals().

[1] If compare(a,b) == compare(b,c) == 0, then compare(a,c) must be 0 as well.

Interesting. If I have 'a similar to b', 'b similar to c' and 'a not similar to c', then depending on the order in which is done the insertion, I may get at the end a and c in the set (if I do the insertion in order a, b, c) or only b if I firstly insert b and after that a and c. I got your point. However, I think the Comparator won't help me much with this issue, except I would avoid hashCode() overriding. After all, I should probably do some testing on my problem to see if this issue will affect on my solution. Thanks! — giliev, Feb 09 '15 at 12:01

score 2 · Answer 5 · edited Jun 20 '20 at 09:12

As stated by others, your latter implementation of .equals() violates its contract. You simply cannot implement it that way. And if you stop to think it, it makes sense, since your implementation of .equals() is not meant to return true when two objects are actually equal, but when they are similar enough. But similar enough is not the same as equal, neither in Java nor anywhere else.

Check .equals() javadocs and you'll see that any object that implements it must adhere to its contract:

The equals method implements an equivalence relation on non-null object references:

It is reflexive: for any non-null reference value x, x.equals(x) should return true.

It is symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.

It is transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.

It is consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.

For any non-null reference value x, x.equals(null) should return false.

Your implementation of .equals() does not fulfill this contract:

Depending on your implementation of double similarity(Car car1, Car car2), it might not be symmetric
It's clearly not transitive (well explained in previous answers)
It might not be consistent:

Consider an example slightly different than the one you gave in a comment:

'cobalt' would be equal to 'blue' while 'red' would be different to 'blue'

If you used some external source to calculate the similarity, such as a dictionary, and if one day 'cobalt' wasn't found as an entry, you might return a similarity near to 0.0, so the cars wouldn't be equal. However, the following day you realize that 'cobalt' is a special kind of 'blue', so you add it to the dictionary and this time, when you compare the same two cars, similarity is very high (or near 1.0), so they're equal. This would be an inconsistency. I don't know how your similarity function works, but if it depends on anything different than the data contained in the two objects you're comparing, you might be violating .equals() consistency constraint as well.

Regarding using a TreeMap<Car, Whatever>, I don't see how it could be of any help. From TreeMap javadocs:

...the Map interface is defined in terms of the equals operation, but a sorted map performs all key comparisons using its compareTo (or compare) method, so two keys that are deemed equal by this method are, from the standpoint of the sorted map, equal.

In other words, in a TreeMap<Car, Whatever> map, map.containsKey(car1) would return true iff car1.compareTo(car2) returned exactly 0 for some car2 that belongs to map. However, if the comparison didn't return 0, map.containsKey(car1) could return false, despite car1 and car2 were very similar in terms of your similarity function. This is because .compareTo() is meant to be used for ordering, and not for similarity.

So the key point here is that you can't use a Map alone to suit your use case, because it's just the wrong structure. Actually, you can't use any Java structure alone that relies on .hashCode() and .equals(), because you could never find an object that matches your key.

Now, if you do want to find the car which is most similar to a given car by means of your similarity() function, I suggest you use Guava's HashBasedTable structure to build a table of similarity coefficients (or whatever other fancy name you like) between every car of your set.

This approach would need Car to implement .hashCode() and .equals() as usual (i.e. not checking just by color, and certainly without invoking your similarity() function). For instance, you could check by a new plate number Car's attribute.

The idea is to have a table which stores the similarities between each car, with its diagonal clean, since we already know that a car is similar to itself (actually, it's equal to itself). For example, for the following cars:

Car a = new Car("red", "audi", "plate1");
Car b = new Car("red", "bmw", "plate2");
Car c = new Car("light red", "audi", "plate3");

the table would look like this:

      a       b       c

a   ----    0.60    0.95

b   0.60    ----    0.45

c   0.95    0.45    ----

For the similarity values, I'm assuming that cars of the same brand and same color family are more similar than cars of same color but different brand, and that cars of different brands and not same color are even less similar.

You might have noticed that the table is symmetric. We could have stored only half the cells if space optimization was needed. However, according to the docs, HashBasedTable is optimized to be accessed by row key, so let's keep it simple and let further optimizations as an exercise.

The algorithm to find the car which is most similar to a given car could be sketched as follows:

Retrieve the given car's row
Return the car which is most similar to the given car within the returned row, i.e. the car of the row with the highest similarity coefficient

Here's some code showing the general ideas:

public class SimilarityTest {

    Table<Car, Car, Double> table;

    void initialize(Car... cars) {
        int size = cars.length - 1; // implicit null check
        this.table = HashBasedTable.create(size, size);
        for (Car rowCar : cars) {
            for (Car columnCar : cars) {
                if (!rowCar.equals(columnCar)) { // add only different cars
                    double similarity = this.similarity(rowCar, columnCar);
                    this.table.put(rowCar, columnCar, similarity);
                }
            }
        }
    }

    double similarity(Car car1, Car car2) {
        // Place your similarity calculation here
    }

    Car mostSimilar(Car car) {
        Map<Car, Double> row = this.table.row(car);
        Map.Entry mostSimilar = Maps.immutableEntry(car, Double.MIN_VALUE);
        for (Map.Entry<Car, Double> entry : row.entrySet()) {
            double mostSimilarCoefficient = mostSimilar.getValue();
            double currentCoefficient = entry.getValue();
            if (currentCoefficient > mostSimilarCoefficient) {
                mostSimilar = entry;
            }
        }
        return mostSimilar.getKey();
    }

    public static void main(String... args) {
        SimilarityTest test = new SimilarityTest();

        Car a = new Car("red", "audi", "plate1");
        Car b = new Car("red", "bmw", "plate2");
        Car c = new Car("light red", "audi", "plate3");

        test.initialize(a, b, c);

        Car mostSimilarToA = test.mostSimilar(a);
        System.out.println(mostSimilarToA); // should be c

        Car mostSimilarToB = test.mostSimilar(b);
        System.out.println(mostSimilarToB); // should be a

        Car mostSimilarToC = test.mostSimilar(c);
        System.out.println(mostSimilarToC); // should be a
    }
}

Regarding complexity... Initializing the table takes O(n2), while searching for the most similar car takes O(n). I'm pretty sure this can be improved, i.e. why putting cars in the table that are known to be not similar to each other? (we could only put cars whose similarity coefficient is higher than a given threshold), or, instead of finding the car with the highest similarity coefficient, we could stop the search when we find a car whose similarity coefficient is higher than another given threshold, etc.

score 0 · Answer 6 · answered Feb 09 '15 at 02:11

Based on my understanding of your similarity() method, I think it may be best to keep your hashCode() function roughly the same, but instead of using color.hashCode(), create a helper method that will generate a "similar color", and use that hashCode:

public int getSimilarColor(String color) {
    if(color == "blue" || color == "light blue" || color == "dark blue" /* add more blue colors*/) {
        return "blue";
    } else if(color == "red" || color == "light red" || color == "dark red" /* add more red colors*/) {
        return "red";
    }
    /*
    else if(yellow...)
    else if(etc...)
    */
    else {
        return color;
    }
}

And then use it in your hashCode method:

@Override
public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ((color == null) ? 0 : getSimilarColor(color).hashCode());
    return result;
}

This helper method may also be useful in similarity(). If you're not comfortable hardcoded similar colors into your method, you could use some other means to generate them, like pattern matching.

Thanks for the advise! However, my list won't be finite, so I should try to find some more generic way to test for equality. — giliev, Feb 09 '15 at 10:17

Overriding hashCode() to be consistent with equals() when equals() uses a similarity metric

6 Answers6