Finding the First Common Substring of a set of strings

Question

I am looking for an implementation of a First Common Substring

Mike is not your average guy. I think you are great.
Jim is not your friend. I think you are great.
Being different is not your fault. I think you are great.

Using a Longest Common Substring implementation (and ignoring punctuation), you would get "I think you are great", but I am looking for the first occurring common substring, in this example:

is not your

Perhaps an implementation that generates and ordered list of all common substrings that I can just take the first from.

Edit

The tokens being compared would be complete words. Looking for a greedy match of the first longest sequence of whole words. (Assuming a suffix tree was used in the approach, each node of the tree would be a word)

I would expect, by your definition, that the first occurring common substring would be "i". — Timothy Shields, Jan 24 '14 at 20:03
The first common substring is `M` (case insensitive) or otherwise `i`. — Henk Holterman, Jan 24 '14 at 20:04
First Common Substring - but how you then define a common substring? are these 3 common words? because if you are not searching for the longest one then it could be even 1 common letter — Paweł Bejger, Jan 24 '14 at 20:04
Yes, the tokens being compared would be complete words. Looking for a _greedy_ match of the first longest sequence of whole words. — theChrisMarsh, Jan 24 '14 at 20:11
@HenkHolterman the letter `M` does not exist in the 3rd item in the list. How could it be common to all of them? — theChrisMarsh, Jan 24 '14 at 20:17
I'm really baffled that this is getting so many VTC for *too broad*. It's pretty well focused, I'd say. — Yuck, Jan 24 '14 at 20:34
I think the definition has 2 parts - first, get the common substrings between all lines and second, find the start index of each. From there you have to find the minimum average start index because presumably there could be multiple matching long strings. This problem is going to be very, very expensive to solve and probably a managed language is not the best choice... — Yuck, Jan 24 '14 at 20:49
@Yuck I like the approach you start to define here, and I agree as a matter of polynomial time it will be expensive, but I am not poor :). — theChrisMarsh, Jan 24 '14 at 21:03
How do you determine which is *first* longest? First in the first string? Lowest combined index in all strings? — Erik Philips, Jan 24 '14 at 21:52

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

There are quite a few steps to do this.

Remove Punctuation
Break down Sentences into list of Words
Create string of all combinations of contiguous words (min:1, max:wordCount)
Join the three lists on new list of string (subsentences)
Sort Accordingly.

Code:

static void Main(string[] args)
{
    var sentence1 = "Mike is not your average guy. I think you are great.";
    var sentence2 = "Jim is not your friend. I think you are great.";
    var sentence3 = "Being different is not your fault. I think you are great.";

    //remove all punctuation 
    // http://stackoverflow.com/questions/421616
    sentence1 = new string(
      sentence1.Where(c => !char.IsPunctuation(c)).ToArray());
    sentence2 = new string(
      sentence2.Where(c => !char.IsPunctuation(c)).ToArray());
    sentence3 = new string(
      sentence3.Where(c => !char.IsPunctuation(c)).ToArray());

    //seperate into words
    var words1 = sentence1.Split(new char[] { ' ' },
      StringSplitOptions.RemoveEmptyEntries).ToList();
    var words2 = sentence2.Split(new char[] { ' ' },          
      StringSplitOptions.RemoveEmptyEntries).ToList();
    var words3 = sentence3.Split(new char[] { ' ' }, 
      StringSplitOptions.RemoveEmptyEntries).ToList();

    //create substring list
    var subSentence1 = CreateSubstrings(words1);
    var subSentence2 = CreateSubstrings(words2);
    var subSentence3 = CreateSubstrings(words3);

    //join then like a Sql Table
    var subSentences = subSentence1
        .Join(subSentence2,
            sub1 => sub1.Value,
            sub2 => sub2.Value,
            (sub1, sub2) => new { Sub1 = sub1, 
                                  Sub2 = sub2 })
        .Join(subSentence3,
            sub1 => sub1.Sub1.Value,
            sub2 => sub2.Value,
            (sub1, sub2) => new { Sub1 = sub1.Sub1, 
                                  Sub2 = sub1.Sub2, 
                                  Sub3 = sub2 })
        ;

    //Sorted by Lowest Index, then by Maximum Words
    subSentences = subSentences.OrderBy(s => s.Sub1.Rank)
      .ThenByDescending(s => s.Sub1.Length)
      .ToList();

    //Sort by Maximum Words, then Lowest Index
    /*subSentences = subSentences.OrderByDescending(s => s.Sub1.Length)
        .ThenBy(s => s.Sub1.Rank)
        .ToList();//*/

    foreach (var subSentence in subSentences)
    {
        Console.WriteLine(subSentence.Sub1.Length.ToString() + " " 
          + subSentence.Sub1.Value);
        Console.WriteLine(subSentence.Sub2.Length.ToString() + " " 
          + subSentence.Sub2.Value);
        Console.WriteLine(subSentence.Sub3.Length.ToString() + " " 
          + subSentence.Sub3.Value);
        Console.WriteLine("======================================");
    }

    Console.ReadKey();

}

//this could probably be done better -Erik
internal static List<SubSentence> CreateSubstrings(List<string> words)
{
    var result = new List<SubSentence>();
    for (int wordIndex = 0; wordIndex < words.Count; wordIndex++)
    {
        var sentence = new StringBuilder();
        int currentWord = wordIndex;
        while (currentWord < words.Count - 1)
        {
            sentence.Append(words.ElementAt(currentWord));
            result.Add(new SubSentence() { Rank = wordIndex, 
              Value = sentence.ToString(), 
              Length = currentWord - wordIndex + 1 });
            sentence.Append(' ');
            currentWord++;
        }
        sentence.Append(words.Last());
        result.Add(new SubSentence() { Rank = wordIndex, 
          Value = sentence.ToString(), 
          Length = words.Count - wordIndex });
    }
    return result;
}

internal class SubSentence
{
    public int Rank { get; set; }
    public string Value { get; set; }
    public int Length { get; set; }
}

Result:

3 is not your

3 is not your

3 is not your

======================================

2 is not

2 is not

2 is not

======================================

1 is

1 is

1 is

======================================

2 not your

2 not your

2 not your

======================================

1 not

1 not

1 not

======================================

1 your

1 your

1 your

======================================

5 I think you are great

5 I think you are great

5 I think you are great

======================================

4 I think you are

4 I think you are

4 I think you are

======================================

3 I think you

3 I think you

3 I think you

======================================

2 I think

2 I think

2 I think

======================================

1 I

1 I

1 I

======================================

4 think you are great

4 think you are great

4 think you are great

======================================

3 think you are

3 think you are

3 think you are

======================================

2 think you

2 think you

2 think you

======================================

1 think

1 think

1 think

======================================

3 you are great

3 you are great

3 you are great

======================================

2 you are

2 you are

2 you are

======================================

1 you

1 you

1 you

======================================

2 are great

2 are great

2 are great

======================================

1 are

1 are

1 are

======================================

1 great

1 great

1 great

======================================

score 0 · Answer 2 · answered Jan 25 '14 at 01:56

Here's a little something that will do what you want. You would actually adjust to pre-build your list of strings, pass that in and it will find for you... in this example, the phrase will be based of the string with the shortest string as a baseline.

public void SomeOtherFunc()
{
    List<string> MyTest = new List<string>();
    MyTest.Add( "Mike is not your average guy. I think you are great." );
    MyTest.Add( "Jim is not your friend. I think you are great." );
    MyTest.Add( "Being different is not your fault. I think you are great." );

    string thePhrase = testPhrase( MyTest );
    MessageBox.Show( thePhrase );
}



public string testPhrase(List<string> test)
{

    // start with the first string and find the shortest.
    // if we can't find a short string in a long, we'll never find a long string in short
    // Ex "To testing a string that is longer than some other string" 
    // vs "Im testing a string that is short"
    // Work with the shortest string.
    string shortest = test[0];
    string lastGoodPhrase = "";
    string curTest;
    int firstMatch = 0;
    int lastMatch = 0;
    int allFound;
    foreach (string s in test)
        if (s.Length < shortest.Length)
            shortest = s;

    // Now, we need to break the shortest string into each "word"
    string[] words = shortest.Split( ' ' );

    // Now, start with the first word until it is found in ALL phrases
    for (int i = 0; i < words.Length; i++)
    {
        // to prevent finding "this" vs "is"
        lastGoodPhrase = " " + words[i] + " ";

        allFound = 0;
        foreach (string s in test)
        {
            // always force leading space for string
            if ((" "+s).Contains(lastGoodPhrase))
               allFound++;
            else
               // if not found in ANY string, its not found in all, get out
               break;
        }

        if (allFound == test.Count)
        {
            // we've identified the first matched field, get out for next phase test
            firstMatch = i;
            // also set the last common word to the same until we can test next...
            lastMatch = i;
            break;
        }
    }

    // if no match, get out
    if (firstMatch == 0)
        return "";

    // we DO have at least a first match, now keep looking into each subsequent
    // word UNTIL we no longer have a match.
    for( int i = 1; i < words.Length - firstMatch; i++ )
    {
        // From where the first entry was, build out the ENTIRE PHRASE
        // until the end of the original sting of words and keep building 1 word back
        curTest = " ";
        for (int j = firstMatch; j <= firstMatch + i; j++)
            curTest += words[j] + " ";

        // see if all this is found in ALL strings
        foreach (string s in test)
            // we know we STARTED with a valid found phrase.
            // as soon as a string NO LONGER MATCHES the new phrase,
            // return the last VALID phrase
            if (!(" " + s).Contains(curTest))
                return lastGoodPhrase;

        // if this is still a good phrase, set IT as the newest
        lastGoodPhrase = curTest;
    }

    return lastGoodPhrase;
}

Finding the First Common Substring of a set of strings

2 Answers2