7

I've been trying to create a nested or recursive effect with SequenceMatcher.

The final goal is comparing two sequences, both may contain instances of different types.

For example, the sequences could be:

l1 = [1, "Foo", "Bar", 3]
l2 = [1, "Fo", "Bak", 2]

Normally, SequenceMatcher will identify only [1] as a common sub-sequence for l1 and l2.

I'd like SequnceMatcher to be applied twice for string instances, so that "Foo" and "Fo" will be considered equal, as well as "Bar" and "Bak", and the longest common sub-sequence will be of length 3 [1, Foo/Fo, Bar/Bak]. That is, I'd like SequenceMatcher to be more forgiving when comparing string members.

What I tried doing is write a wrapper for the built-in str class:

from difflib import SequenceMatcher
class myString:
    def __init__(self, string):
        self.string = string
    def __hash__(self):
        return hash(self.string)
    def __eq__(self, other):
        return SequenceMatcher(a=self.string, b=self.string).ratio() > 0.5

Edit: perhaps a more elegant way is:

class myString(str):
    def __eq__(self, other):
        return SequenceMatcher(a=self, b=other).ratio() > 0.5

By doing this, the following is made possible:

>>> Foo = myString("Foo")
>>> Fo = myString("Fo")
>>> Bar = myString("Bar")
>>> Bak = myString("Bak")
>>> l1 = [1, Foo, Bar, 3]
>>> l2 = [1, Fo, Bak, 2]
>>> SequenceMatcher(a=l1, b=l2).ratio()
0.75

So, evidently it's working, but I have a bad feeling about overriding the hash function. When is the hash used? Where can it come back and bite me?

SequenceMatcher's documentation states the following:

This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.

And by definition hashable elements are required to fulfill the following requirement:

Hashable objects which compare equal must have the same hash value.

In addition, do I need to override cmp as well?

I'd love to hear about other solutions that come to mind.

Thanks.

geckon
  • 7,063
  • 2
  • 28
  • 55
YaronK
  • 742
  • 1
  • 6
  • 14

1 Answers1

1

Your solution isn't bad - you could also look at re-working the SequenceMatcher to recursively apply when elements of a sequence are themselves iterables, with some custom logic. That would be sort of a pain. If you only want this subset of SequenceMatcher's functionality, writing a custom diff tool might not be a bad idea either.

Overriding __hash__ to make "Foo" and "Fo" equal will cause collisions in dictionaries (hash tables) and such. If you're literally only interested in the first 2 characters and are set on using SequenceMatcher, returning cls.super(self[2:]) might be the way to go.

All that said, your best bet is probably a one-off diff tool. I can sketch out the basics of something like that if you're interested. You just need to know what the constraints are in the circumstances (does the subsequence always start on the first element, that kind of thing).

a p
  • 2,785
  • 2
  • 22
  • 40
  • Can you elaborate more on the one-off diff tool? Isn't it an overkill to re-implement the matching algorithm? – geckon May 19 '15 at 14:48
  • Well, it depends on the situation and constraints. If you're looking for strict equality except in the case of strings (or all sub-iterables?), and the order is always the same, and you know where the matching should start... Or 2 of those 3, or anything else. A generalized algo isn't always the best because tailoring it to your needs can be harder than just making something new that does what you want. – a p May 19 '15 at 17:59
  • Well I have two sequences (let's say lists) of objects and I want to find the difference between them. But for comparison I don't want to use any method of the objects' class (not even `__hash__`, `__eq__`, etc.) but I want to provide a function that will be called with each object as a parameter and the returned value will be used for comparison. This "key generating function" will be written by me and can use methods of the objects' class, standard functions etc. and will return let's say a string. But it can be more general (return type-wise) as well. – geckon May 19 '15 at 18:04
  • Maybe you could give me some example inputs and outputs? Not sure how well your use case matches up with YaronK's original stated aim. – a p May 19 '15 at 18:49
  • My usecase is a generalized version of YaronK's question. If I try and map it then the objects are strings and the "key generating function" returns first two characters for each given object (string) to be compared. – geckon May 19 '15 at 22:15