0

So I have a recursive code that gives the best alignment for 2 DNA strands, but the problem is that it performs very slowly (I need it to be recursive). Then I read on an MIT website that the results are additive, which is great for me, but then I thought about it a little bit and I found out there is a problem:

website: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-096-algorithms-for-computational-biology-spring-2005/lecture-notes/lecture5_newest.pdf

The MIT website says that for a given spilt(i,j): first_strand(0, i) and second_strand(0,j) alignment
+ first_strand(i, len) and second_strand(j, len) alignment

equals

first_strand and second strand alignment

but:

GTC GTAA

G with GTA alignment is G-- and GTA TC with A alignment is TC and A- result = G--TC and GTAA-

real best result = GTC- GTAA

Can anyone explain what they mean on the MIT website? I'm probably getting it all wrong!

user2918984
  • 103
  • 3
  • 11
  • Might be useful to provide a link to the actual website you are referencing as the nature of the problem remains unclear. – treddy Nov 23 '13 at 09:58

1 Answers1

5

I assume you're talking about this link.

If so, read it very carefully hundreds of times ;-) It's "additive" given that you're only considering alignments where the split is fixed at a specific (i, j) pair.

In your supposed counterexample, you started by breaking the initial G off of GTC and the initial GTA off of GTAA. Then G-- is the shortest way to change GTC into G. Fine. Continuing with the same split, you still needed to align the remaining right-hand parts: TC with A. Also fine.

This is no claim that this is the best possible split. There's only the claim that it's the best possible alignment given that you're only considering that specific split.

It's one small step in the dynamic programming approach, which is the part you're missing. It remains to compute the best alignments across all possible splits.

Dynamic programming is tricky at first. You shouldn't expect to learn it from staring at telegraphic slides. Read a real textbook, or search the web for tutorials.

Speeding a recursive version

The comments indicate that the code for this "must" be recursive. Oh well ;-)

Caution: I just threw this together to illustrate a general procedure for speeding suitable recursive functions. It's barely been tested at all.

First an utterly naive recursive version:

def lev(a, b):
    if not a:
        return len(b)
    if not b:
        return len(a)
    return min(lev(a[:-1], b[:-1]) + (a[-1] != b[-1]),
               lev(a[:-1], b) + 1,
               lev(a, b[:-1]) + 1)

I'll be using "absd31-km" and "ldk3-1fjm" as arguments in all runs discussed here.

On my box, using Python 3, that simple function returns 7 after about 1.6 seconds. It's horribly slow.

The most obvious problem is the endlessly repeated string slicing. Each : in an index takes time proportional to the current length of the string being sliced. So the first refinement is to pass string indices instead. Since the code always slices off a prefix of a string, we only need to pass the "end of string" indices:

def lev2(a, b):
    def inner(j1, j2):
        if j1 < 0:
            return j2 + 1
        if j2 < 0:
            return j1 + 1
        return min(inner(j1-1, j2-1) + (a[j1] != b[j2]),
                   inner(j1-1, j2) + 1,
                   inner(j1, j2-1) + 1)
    return inner(len(a)-1, len(b)-1)

Much better! This version returns 7 in "only" about 1.44 seconds. Still horridly slow, but better than the original. It's advantage would increase on longer strings, but who cares ;-)

We're almost done! The important thing to notice now is that the function passes the same arguments many times over the course of a run. We capture those in "a memo" to avoid all the redundant computation:

def lev3(a, b):
    memo = {}
    def inner(j1, j2):
        if j1 < 0:
            return j2 + 1
        if j2 < 0:
            return j1 + 1
        args = j1, j2
        if args in memo:
            return memo[args]
        result = min(inner(j1-1, j2-1) + (a[j1] != b[j2]),
                     inner(j1-1, j2) + 1,
                     inner(j1, j2-1) + 1)
        memo[args] = result
        return result
    return inner(len(a)-1, len(b)-1)

That version returns 7 in about 0.00026 seconds, over 5000 times faster than lev2 did it.

Now if you've studied the matrix-based algorithms, and squint a little, you'll see that lev3() effectively builds a 2-dimensional matrix mapping index pairs to results in its memo dictionary. They're really the same thing, except that the recursive version builds the matrix in a more convoluted way. On the other hand, the recursive version may be easier to understand and to reason about. Note that the slides you found called the memoization aporoach "top down" and the nested-loop matrix approach "bottom up". Those are nicely descriptive.

You haven't said anything about how your recursive function works, but if it suffers any similar kinds of recursive excess, you should be able to get similar speedups using similar techniques :-)

Tim Peters
  • 55,793
  • 10
  • 105
  • 118
  • _"You shouldn't expect to learn it from staring at telegraphic slides."_ I really like this sentence very much – eyquem Nov 23 '13 at 11:15
  • so ill have to check all of the best alignments for all the given splits and than add what? i'm kinda confused. As i said i have a functions that gives the best score by trying all of the options(recursively) im trying to optimize it, so i thought splitting would work, but i guess its harder than i thought – user2918984 Nov 23 '13 at 16:59
  • It's only additive given that the split *is fixed*. And that's really more trivial than profound. Dynamic programming gives an approach to organizing the "all possible splits" computations in a systematic way that avoids redundant computations. Yes, it's harder than you thought - but still doable :-) – Tim Peters Nov 23 '13 at 17:10
  • So let say i have a function called get_best_ali which returns the best alignment of 2 DNA strands, the problem is with long strands,of course it takes forever, now if ill split the strand in a fixed (i,j) and ill split again and again(obviously with different i,j) and when the strand is small enough ill call my function on the 2 short strands, than ill be able to add all the results?? im still confused – user2918984 Nov 23 '13 at 18:33
  • No, the results are not additive _across_ splits. Please read a textbook or study detailed tutorials. *Across* splits `min()` is used to find the best path. This is simply too complex to explain in SO comments. Have you tried searching the web at all? Try, e.g., "levenshtein distance tutorial". There are many detailed explanations available. The slides you found will just keep on confusing you because they're mostly summarizing the results, not really explaining how the results were obtained. – Tim Peters Nov 23 '13 at 18:54
  • Note: *after* studying tutorials, [click this to see implementations in many programming languages](http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance). They won't make much sense until you understand the principles underlying them. – Tim Peters Nov 23 '13 at 19:18
  • first of all, thank you for being helpful. second, i have to write a recursive function that gives best score of 2 DNA strands, all the codes and stuff are great, and i read a lot, really, before asking here, but all of them are not recursive. so i guess what im trying to ask is if there is a good kind efficient way to do align to strands with recursion? thank you again – user2918984 Nov 23 '13 at 20:00
  • No *serious* programs use recursion for this (except in functional languages), because it's needlessly slower. Have you tried memo-izing your function? [See here](http://stackoverflow.com/questions/1988804/what-is-memoization-and-how-can-i-use-it-in-python/1988826#1988826). That's the standard way to save mountains of wasted redundant work in recursive functions. Beyond that, nobody here can *guess* what your function looks like now, right? Too much blind guessing here ;-) – Tim Peters Nov 23 '13 at 23:11
  • Very helpful, now ill just have to modify it somehow so it will return a tuple that contains (best_score, first dna, second dna). I'm guessing ill have to build the matrix and than traceback? One thing i need to do for sure is read a lot more about dynamic programming. Thank you again – user2918984 Nov 24 '13 at 11:47