10

I am using difflib python package. No matter whether I set isjunk argument, the calculated ratios are the same. Isn't the difference of spaces ignored when isjunk is lambda x: x == " "?

In [193]: difflib.SequenceMatcher(isjunk=lambda x: x == " ", a="a b c", b="a bc").ratio()
Out[193]: 0.8888888888888888

In [194]: difflib.SequenceMatcher(a="a b c", b="a bc").ratio()
Out[194]: 0.8888888888888888
Anthon
  • 51,019
  • 25
  • 150
  • 211
RNA
  • 126,288
  • 12
  • 45
  • 61
  • Might be wrong, but would `a` and `b` both essentially become `"abc"` if their spaces are being ignored by `difflib`? – mdscruggs May 08 '13 at 03:02
  • yes and it will return `1.0` – Ryan Saxe May 08 '13 at 03:05
  • 2.7 docstring for SequenceMatcher: ".ratio() returns a float in [0, 1], measuring the "similarity" of the sequences. As a rule of thumb, a .ratio() value over 0.6 means the sequences are close matches" – mdscruggs May 08 '13 at 03:07

3 Answers3

5

isjunk works a little differently than you might think. In general, isjunk merely identifies one or more characters that do not affect the length of a match but that are still included in the total character count. For example, consider the following:

>>> SequenceMatcher(lambda x: x in "abcd", " abcd", "abcd abcd").ratio()
0.7142857142857143

The first four characters of the second string ("abcd") are all ignorable, so the second string can be compared to the first string beginning with the space. Starting with the space in both the first string and the second string, then, the above SequenceMatcher finds ten matching characters (five in each string) and 4 non-matching characters (the ignorable first four characters in the second string). This gives you a ratio of 10/14 (0.7142857142857143).

In your case, then, the first string "a b c" matches the second string at indices 0, 1, and 2 (with values "a b"). Index 3 of the first string (" ") does not have a match but is ignored with regard to the length of the match. Since the space is ignored, index 4 ("c") matches index 3 of the second string. Thus 8 of your 9 characters match, giving you a ratio of 0.88888888888888.

You might want to try this instead:

>>> c = a.replace(' ', '')
>>> d = b.replace(' ', '')
>>> difflib.SequenceMatcher(a=c, b=d).ratio()
1.0
πόδας ὠκύς
  • 10,303
  • 2
  • 32
  • 42
1

You can see what it considers to be matching blocks:

>>> difflib.SequenceMatcher(isjunk=lambda x: x == " ", a="a b c", b="a bc").get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=4, b=3, size=1), Match(a=5, b=4, size=0)]

The first two tell you that it matches "a b" to "a b" and "c" to "c". (The last one is trivial)

The question is why "a b" can be matched. I found the answer to this in the code. First the algorithm finds a bunch of matching blocks by repeatedly calling find_longest_match. What's notable about find_longest_match is that it allows the junk character to exist on the ends of the string:

If isjunk is defined, first the longest matching block is
determined as above, but with the additional restriction that no
junk element appears in the block.  Then that block is extended as
far as possible by matching (only) junk elements on both sides.  So
the resulting block never matches on junk except as identical junk
happens to be adjacent to an "interesting" match.

This means that first it considers "a " and " b" to be matches (allowing the space character on the end of "a " and at the beginning of " b").

Then, the interesting part: the code does one last check to see if any of the blocks are adjacent, and merges them if they are. See this comment in the code:

    # It's possible that we have adjacent equal blocks in the
    # matching_blocks list now.  Starting with 2.5, this code was added
    # to collapse them.

So basically it's matching "a " and " b", then merging those two blocks into "a b" and calling that a match, despite the space character being junk.

chappy
  • 1,017
  • 9
  • 13
0

The number of matches is the same for both invocations (3). You can check this by using:

print difflib.SequenceMatcher(isjunk=lambda x: x == " ", a="a b c", b="a bc").get_matching_blocks()
print difflib.SequenceMatcher(a="a b c", b="a bc").get_matching_blocks()

(They are actually the same because of the way the algorithm 'adjusts' for adjacent matches).

Since the ratio is only dependent on the length of these matches and the length of the originals (junk included), you get the same rations.

Anthon
  • 51,019
  • 25
  • 150
  • 211