I've been trying to create a nested or recursive effect with SequenceMatcher.
The final goal is comparing two sequences, both may contain instances of different types.
For example, the sequences could be:
l1 = [1, "Foo", "Bar", 3]
l2 = [1, "Fo", "Bak", 2]
Normally, SequenceMatcher will identify only [1] as a common sub-sequence for l1 and l2.
I'd like SequnceMatcher to be applied twice for string instances, so that "Foo"
and "Fo"
will be considered equal, as well as "Bar"
and "Bak"
, and the longest common sub-sequence will be of length 3 [1, Foo/Fo, Bar/Bak]
. That is, I'd like SequenceMatcher to be more forgiving when comparing string members.
What I tried doing is write a wrapper for the built-in str class:
from difflib import SequenceMatcher
class myString:
def __init__(self, string):
self.string = string
def __hash__(self):
return hash(self.string)
def __eq__(self, other):
return SequenceMatcher(a=self.string, b=self.string).ratio() > 0.5
Edit: perhaps a more elegant way is:
class myString(str):
def __eq__(self, other):
return SequenceMatcher(a=self, b=other).ratio() > 0.5
By doing this, the following is made possible:
>>> Foo = myString("Foo")
>>> Fo = myString("Fo")
>>> Bar = myString("Bar")
>>> Bak = myString("Bak")
>>> l1 = [1, Foo, Bar, 3]
>>> l2 = [1, Fo, Bak, 2]
>>> SequenceMatcher(a=l1, b=l2).ratio()
0.75
So, evidently it's working, but I have a bad feeling about overriding the hash function. When is the hash used? Where can it come back and bite me?
SequenceMatcher's documentation states the following:
This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
And by definition hashable elements are required to fulfill the following requirement:
Hashable objects which compare equal must have the same hash value.
In addition, do I need to override cmp as well?
I'd love to hear about other solutions that come to mind.
Thanks.