Longest Common Substring without cutting a word

Question

I am very new in programming and I am trying to solve one of longest common sequence/substring problems in Java. So the algorithm question I am working on is to find longest common substring without cutting words.

For instance: given string1 = He had 3 pennies and 5 quarters and string2 = q3nniesp should return pennies.

Other example: string1 = They named the new place face cafe and string2 = e face and output will be e face cafe.

I am trying to figure this algorithm out but I can't decide if I need to convert these to char array or evaluate it as string. The way both strings can have spaces are confusing me.

I followed some of the existing stackoverflow questions and trying to modify this code from https://www.geeksforgeeks.org/:

static String findLongestSubsequence(String str1, String str2) {

        char[] A = str1.toCharArray();
        char[] B = str2.toCharArray();
        if (A == null || B == null) return null;

        final int n = A.length;
        final int m = B.length;

        if (n == 0 || m == 0) return null;

        int[][] dp = new int[n+1][m+1];

        // Suppose A = a1a2..an-1an and B = b1b2..bn-1bn
        for (int i = 1; i <= n; i++ ) {
            for (int j = 1; j <= m; j++) {

                // If ends match the LCS(a1a2..an-1an, b1b2..bn-1bn) = LCS(a1a2..an-1, b1b2..bn-1) + 1
                if (A[i-1] == B[j-1]) dp[i][j] = dp[i-1][j-1] + 1;

                    // If the ends do not match the LCS of a1a2..an-1an and b1b2..bn-1bn is
                    // max( LCS(a1a2..an-1, b1b2..bn-1bn), LCS(a1a2..an-1an, b1b2..bn-1) )
                else dp[i][j] = Math.max(dp[i-1][j], dp[i][j-1]);

            }
        }

        int lcsLen = dp[n][m];
        char[] lcs = new char[lcsLen];
        int index = 0;

        // Backtrack to find a LCS. We search for the cells
        // where we included an element which are those with
        // dp[i][j] != dp[i-1][j] and dp[i][j] != dp[i][j-1])
        int i = n, j = m;
        while (i >= 1 && j >= 1) {

            int v = dp[i][j];

            // The order of these may output different LCSs
            while(i > 1 && dp[i-1][j] == v) i--;
            while(j > 1 && dp[i][j-1] == v) j--;

            // Make sure there is a match before adding
            if (v > 0) lcs[lcsLen - index++ - 1] = A[i-1]; // or B[j-1];

            i--; j--;

        }

        return new String(lcs, 0, lcsLen);
    }

But I keep getting wrong outputs. For instance first output gives output = 3nnies, I am really stuck at this point, can anybody gives a hand or a little scratch of algorithm? Thank y'all.

Substring : A substring is a contiguous sequence of characters within a string, where oder matters. So, string1 = He had 3 pennies and 5 quarters string2 = q3nniesp It should return nnies ? — , Apr 11 '19 at 08:23
It should return pennies, It seems like it doesn't have to be contiguous... — interstellar, Apr 11 '19 at 09:17

Rann Lifshitz · Accepted Answer · 2019-04-11T11:34:20.253

I tried out your original algorithm, unfortunately, it was not in the right direction.

I am under the assumptions that the following guidelines apply:

A matching substring contains characters from the given substring which may be out of order.
A character from the given substring may appear multiple times in the matching substring.

I, therefore, took the liberty of using a brute force algorithm while using java streams:

// complexity of O(N*M), where N is the length of the original string and M is the length of the substring
static String longestCommonSequence(String string, String sub) {
    List<Character> primaryMatch = new ArrayList<>();
    List<Character> secondaryMatch = new ArrayList<>();
    // N iterations loop on original string
    for(char c : string.toCharArray()) {
      // M iterations loop on the substring
      if(sub.indexOf(c) != -1) {
        primaryMatch.add(c);
      }
      else {
        if(!primaryMatch.isEmpty()) {
          // replace secondaryMatch content with the current longet match
          if(secondaryMatch.size() <= primaryMatch.size()) {
            secondaryMatch.clear();
            secondaryMatch.addAll(primaryMatch);
          }
          primaryMatch.clear();
        }
      }
    }
    if(primaryMatch.size() < secondaryMatch.size()) {
      return secondaryMatch.stream().map(String::valueOf).collect(Collectors.joining());
    }
    return primaryMatch.stream().map(String::valueOf).collect(Collectors.joining());
}

The outputs for the inputs you provided are:

string1 = He had 3 pennies and 5 quarters and string2 = q3nniesp ---> pennies
string1 = They named the new place face cafe and string2 = e face ---> ace face cafe

Note the difference in the second output - based on the output behavior you described the result of the above algorithm is the correct one, since ace face cafe is a longer than e face cafe, as multiple usages of characters from the given substring are allowed.

Note a subtle issue in the algorithm: if(secondaryMatch.size() <= primaryMatch.size())

The current implementation will, in case of several matching substrings of the same length, return the last match (based on the order of characters in the original string). If you wish to return the first match, replace the <= with <.

If the assumptions I described are not allowed - please comment on this answer and I will update it according to your specifications.

this worked very well, thank you for your time and attention. — interstellar, Apr 16 '19 at 06:36

Longest Common Substring without cutting a word

1 Answers1