5

I am using difflib to identify all the matches of a short string in a longer sequence. However it seems that when there are multiple matches, difflib only returns one:

> sm = difflib.SequenceMatcher(None, a='ACT', b='ACTGACT')
> sm.get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=3, b=7, size=0)]

The output I expected was:

[Match(a=0, b=0, size=3), Match(a=0, b=4, size=3), Match(a=3, b=7, size=0)]

In fact the string ACTGACT contains two matches of ACT, at positions 0 and 4, both of size 3 (plus another match of size 0 at the end of the strings).

How can I get multiple matches? I was expecting difflib to return both positions.

JasonMArcher
  • 12,386
  • 20
  • 54
  • 51
dalloliogm
  • 7,737
  • 5
  • 40
  • 55
  • Why are you using`SequenceMatcher` - I'm not sure I'm following your logic of why you're expecting 3 matches? – Jon Clements Feb 27 '15 at 11:24
  • I was expecting two matches on the sequence (at positions 0 and 4), plus one of size 0 (see the difflib documentation - get_matching_blocks always return an extra match of size 0 at the end of the sequence) – dalloliogm Mar 02 '15 at 13:46
  • 2
    The purpose of the difflib as I understand it is to for instance compare two blocks of text side by side and be able to point out what was common and what is not (kind of how the revision works here, the difference between the initial version against the revised version with all the additions and deletions highlighted, the common things being kept unmarked). As such, if ACT was found to match once in both strings, it won't try to match another unless there's another in both strings again. – Jerry Mar 04 '15 at 12:50

2 Answers2

2

Why would you use difflib for that? You should be able to just use standard regular expressions.

import re
pattern = "ACT"
text = "ACTGACT"
matches = [m.span() for m in re.finditer(pattern, text)]

which will give you:

[(0, 3), (4, 7)]

Or does this for some reason not include the information that you are interested in? It of course does not return the last empty match that difflib returns but you could easily just create that.

k-nut
  • 3,221
  • 2
  • 15
  • 28
  • Do you mean: `matches = [m.span() for m in re.finditer(pattern, text)]` ? (That was going to be what I was going to answer, but the OP hasn't clarified exactly why they're using `SequenceMatcher` and why the empty match is important) – Jon Clements Feb 27 '15 at 13:04
  • @JonClements yes you are right. Sorry about the variable mixup. Also using `span()` seems like a good idea. No need for the full match object.. – k-nut Feb 27 '15 at 13:07
  • Thank you for answering, however the one above is just an example. I need to search for a short sequence in big one and I wanted to see if I can use difflib instead of defining patterns like A.*C.*G.*. – dalloliogm Feb 28 '15 at 15:28
  • Can you then provide an example that shows your situation more precisely? – k-nut Feb 28 '15 at 15:30
2

As Jerry pointed out, and k-nut correctly answers, you are using the wrong algorithm for your problem. k-nut's answer honestly isn't all that bad, but it's not exactly the most efficient way to solve this class of problems. I am a bioinformatician and given your question and the example case, it seems very much like you are trying to solve "our" classical DNA sequence alignment/search problem (see the scientific literature by scientific super-stars like Altschul or "Gene" Myers on the issue if you are interested in the nitty-gritty details and want to read one of the most cited papers of all times).

Finding short segments in a database of long segments efficiently is exactly what Altschul's now famous BLAST algorithm solves heuristically and/or can be done using Smith-Waterman for exact lookups. The most efficient way to do this in Python is probably by using BioPython, and in particular, you might want to look at the section describing how to set up a local NCBI BLAST+ instance. If you are not "married" to Python, today there are even faster implementations of BLAST, like FSA-BLAST.

On the other hand, if you need exact matches (as opposed to the heuristics made by BLAST), which might be the case if you don't mind long query times and have a small reference sequence (B in your examples), you could go with the official Smith-Waterman (SW) alignment. If not, and you still need exact matches, first filter matches with BLAST and then reduce your set with SW alignments of the candidates.

You could implement SW in pure Python, even just use any existing pure-Python implementation, but I would only recommended that path for purely educational purposes (check out swalign on GitHub, for example). If you nonetheless want a reasonably strong Python-based implementation check out scikit-bio for SW alignments, although scikit-bio is still in alpha-status. But first read up on the SW WikiPedia page already linked above, and depending on the hardware you have, you might instead use a GPU- or at least SIMD-optimized implementation in CUDA or C++. If you want a nice version with a Python wrapper, check out the SSWlib.

fnl
  • 3,736
  • 4
  • 21
  • 29