3

I have found an algorithm for Longest Common Substring. It is usually done using dynamic programming, using a 2-D array of size mxn where m and n are lengths of the two strings under consideration.

I will construct the following matrix for the two strings.

M[i][j] = 1 if s1[i]==s2[j] else 0.

For example, if the strings are: abcxy and pqaabx

The matrix looks as follows:

    a b c x y
 p  0 0 0 0 0
 q  0 0 0 0 0
 a  1 0 0 0 0
 a  1 0 0 0 0
 b  0 1 0 0 0
 x  0 0 0 1 0

Now, I search for a maximal continuous sequence of 1s in every diagonal which is in top-left to bottom-right direction.

The maximum value among these will be the answer.

I can perform the above operation without using the array explicitly. The time-complexity is still O(M*N). So, there is no need of memory.

Can anyone point me where I am going wrong?

nitish712
  • 18,496
  • 5
  • 23
  • 33
  • Looks good to me - is there any reason you think this is not correct? – Peter de Rivaz Feb 27 '14 at 13:54
  • @PeterdeRivaz if this is correct, then why does [wikipedia](http://en.wikipedia.org/wiki/Longest_common_substring_problem) uses an algorithm which uses additional memory? Also I didn't find any solution with `O(MN)` complexity and without additional memory. – nitish712 Feb 27 '14 at 14:31

2 Answers2

1

Your method is correct. For proof suppose the longest common substring for S1 and S2 was from S1[i..j] and S2[p..q]. this implies S1[i+k] = S2[p+k]

These all lie on the diagonal starting from (i,p).

The dynamic programming solution does the same thing but instead of computing the table first and going through diagonal paths it computes the table depending on it's diagonal parent plus whether or not they match.

EDITED

On your comment on the wikipedia solution using additional memory. It's there only for clarity. In principle you need only two rows of the matrix in the wikipedia solution and keep the current maximum count in one variable. This is correct since for any (i,j)th entry in the matrix

M(i,j) = 1 + M(i-1, j-1) (if s1[i] == s2[j])

as you can see the current row elements depend only on the elements of the immediately upper row.

sukunrt
  • 1,437
  • 9
  • 20
1

Your algorithm is correct, but the standard DP approach eliminates your second phase, and makes the solution simpler.

Instead of marking boolean values and then scanning the diagonals to look for longest sequences, you can compute the diagonal lengths as you build the matrix - Only one pass is required.

In terms of time and space complexity, both solutions are O(NxM). Your solution can save some memory if you use a bit matrix representation, while the other solution is probably slightly faster.

Eyal Schneider
  • 21,096
  • 4
  • 43
  • 73
  • I actually don't want to use the matrix. I emphasize that same operation can also be done without matrix. I mean just using two `loops` – nitish712 Feb 27 '14 at 15:35
  • 1
    @nitish712: actually you are right, you can do it using only constant additional space, and O(NxM) time. Note however that the Wikipedia page doesn't suggest what data structure to use - it only describes the recursive process using a table. – Eyal Schneider Feb 27 '14 at 15:41