Python find similar sequences in string

Question

I want a code to return sum of all similar sequences in two string. I wrote the following code but it only returns one of them

from difflib import SequenceMatcher
a='Apple Banana'
b='Banana Apple'
def similar(a,b):
    c = SequenceMatcher(None,a.lower(),b.lower()).get_matching_blocks()
    return sum( [c[i].size if c[i].size>1 else 0 for i in range(0,len(c)) ] )
print similar(a,b)

and the output will be

I expect it to be: 11

score 0 · Answer 1 · answered Oct 04 '17 at 21:12

When we edit your code to this it will tell us where 6 is coming from:

from difflib import SequenceMatcher
a='Apple Banana'
b='Banana Apple'
def similar(a,b):
    c = SequenceMatcher(None,a.lower(),b.lower()).get_matching_blocks()
    for block in c:
        print "a[%d] and b[%d] match for %d elements" % block
print similar(a,b)

a[6] and b[0] match for 6 elements

a[12] and b[12] match for 0 elements

Antimony · Accepted Answer · 2017-10-05T00:34:41.413

0

get_matching_blocks() returns the longest contiguous matching subsequence. Here the longest matching subsequence is 'banana' in both the strings, with length 6. Hence it is returning 6.

Try this instead:

def similar(a,b):
    c = 'something' # Initialize this to anything to make the while loop condition pass for the first time
    sum = 0

    while(len(c) != 1):
        c = SequenceMatcher(lambda x: x == ' ',a.lower(),b.lower()).get_matching_blocks()

        sizes = [i.size for i in c]
        i = sizes.index(max(sizes))
        sum += max(sizes)

        a = a[0:c[i].a] + a[c[i].a + c[i].size:]
        b = b[0:c[i].b] + b[c[i].b + c[i].size:]

    return sum

This "subtracts" the matching part of the strings, and matches them again, until len(c) is 1, which would happen when there are no more matches left.

However, this script doesn't ignore spaces. In order to do that, I used the suggestion from this other SO answer: just preprocess the strings before you pass them to the function like so:

a = 'Apple Banana'.replace(' ', '')
b = 'Banana Apple'.replace(' ', '')

You can include this part inside the function too.

edited Oct 05 '17 at 00:34

answered Oct 04 '17 at 21:19

Antimony

2,161
3
26
38

Tried to set `a='Apple Banana Orange'` & `b='Orange Banana Apple'`, then get a result `13`? – thewaywewere Oct 04 '17 at 22:03
I have updated my answer to handle more general cases, including the one that you mentioned. Thanks for pointing it out! – Antimony Oct 04 '17 at 22:48
if you move sizes = [i.size for i in c] i = sizes.index(max(sizes)) inside while loop it works better – Mostafa Ghafoori Oct 05 '17 at 00:04
Whoops! That was a mistake I made when copying the lines to SO. Thanks for the heads up. I've updated my answer. – Antimony Oct 05 '17 at 00:34
you should use get_longest_match and subtract that segment, instead – Veltzer Doron Sep 30 '20 at 09:52

score 0 · Answer 3 · answered Oct 05 '17 at 00:03

I made a small change to your code and it is working like a charm, thanks @Antimony

def similar(a,b):
    a=a.replace(' ', '')
    b=b.replace(' ', '')

    c = 'something' # Initialize this to anything to make the while loop condition pass for the first time
    sum = 0
    i = 2
    while(len(c) != 1):
        c = SequenceMatcher(lambda x: x == ' ',a.lower(),b.lower()).get_matching_blocks()
        sizes = [i.size for i in c]
        i = sizes.index(max(sizes))
        sum += max(sizes)
        a = a[0:c[i].a] + a[c[i].a + c[i].size:]
        b = b[0:c[i].b] + b[c[i].b + c[i].size:]
    return sum

Python find similar sequences in string

3 Answers3