Using a trie for string segmentation - time complexity?

Question

Problem to be solved:

Given a non-empty string s and a string array wordArr containing a list of non-empty words, determine if s can be segmented into a space-separated sequence of one or more dictionary words. You may assume the dictionary does not contain duplicate words.

For example, given s = "leetcode", wordArr = ["leet", "code"].

Return true because "leetcode" can be segmented as "leet code".

In the above problem, would it work to build a trie that has each string in wordArr. Then, for each char in given string s, work down the trie. If a trie branch terminates, then this substring is complete so pass the remaining string up to the root and do the exact same thing recursively.

This should be O(N) time and O(N) space correct? I ask because the problem I'm working on says this will be O(N^2) time in the most optimal way and I'm not sure what's wrong with my approach.

For example, if s = "hello" and wordArr = ["he", "ll", "ee", "zz", "o"], then "he" will be completed in the first branch of the trie, "llo" will be passed up to the root recursively. Then, "ll" will be completed, so "o" gets passed up to root of trie. Then "o" is completed, which is the end of s, so return true. If the end of s isn't completed, return false.

Is this correct?

There could be backtracking involved, if `wordArr` does not contain disjoint words. Suppose `wordArr = ["lee", "leet", "code"]`. You would match `lee` first, then waste a lot of time trying to find a match for `tcode`. — chepner, Feb 28 '17 at 17:59

score 1 · Answer 1 · answered Feb 28 '17 at 18:01

Your example would indeed suggest a linear time complexity, but look at this example:

 s = "hello" 
 wordArr = ["hell", "he", "e", "ll", "lo", "l", "h"]

Now, first "hell" is tried, but in the next recursion cycle, no solution is found (there is no "o"), so the algorithm needs to backtrack and assume "hell" is not suitable (pun not intended), so you try "he", and in the next level you find "ll", but then again it fails, as there is no "o". Again backtracking is needed. Now start with "h", then "e" and then again a failure is coming: you try "ll" without success, so backtracking to use "l" instead: the solution is now available: "h e l lo".

So, no this does not have O(n) time complexity.

I wrote this on cxw's answer, but what would be the time complexity of the backtracking portion? — segue_segway, Feb 28 '17 at 18:25

score 0 · Accepted Answer · edited May 23 '17 at 12:32

I suspect off-hand that the issue is backtracking. What if the word is not segmentable based on a particular dictionary, or what if there are multiple possible substrings with a common prefix? E.g., suppose the dictionary contains he, llenic, and llo. Failure down one branch of the trie would require backtracking, with some corresponding increase in time complexity.

This is similar to a regex-match problem: the example you give is like testing an input word against

^(he|ll|ee|zz|o)+$

(any number of dictionary members, in any order, and nothing else). I don't know the time complexity of regex matchers offhand, but I know backtracking can get you into serious time trouble.

I did find this answer which says:

Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).

So maybe it is O(n^2) with reduced construction effort.

What would the time complexity be with backtracking in a trie? I'm having trouble analyzing that. — segue_segway, Feb 28 '17 at 18:25
@Sunny you are beyond the limits of my theoretical CS knowledge, I'm sorry to say! :) I would guess that you would have something like *O(n log m)* for string length *n* and trie depth *m*, but that's just a guess. I am getting that from a trie search (log time, if the trie is moderately balanced or if you can convert it to a BST) at each character position. I would say check out the NFA algorithm referred to in the linked other answer. Good luck! — cxw, Feb 28 '17 at 18:32

Dolev · Answer 3 · 2017-03-02T11:15:05.037

Let's start by converting the trie to a nfa. We create an accept node on the root and add an edge that moves from every word end of the dictionary in the trie to the root node for the empty char.

Time complexity: since each step in the trie we can move only to one edge that represent the current char in the input string and the root. T(n) = 2×T (n-1)+c That gives us O(2^n)

Indeed not O(n), But you can do better using Dynamic programming.

We will use top-down approach.
Before we solve it for any string check if we have already solve it.
We can use another HashMap to store the result of already solved strings.
Whenever any recursive call returns false, store that string in HashMap.

The idea is to calculate every suffix of the word only once. We have only n suffixes and It will end up with O(n^2).

Code form algorithms.tutorialhorizon.com:

Map<String, String> memoized;
Set<String> dict;

String SegmentString(String input) {
  if (dict.contains(input)) return input;
  if (memoized.containsKey(input) {
    return memoized.get(input);
  }
  int len = input.length();
  for (int i = 1; i < len; i++) {
    String prefix = input.substring(0, i);
    if (dict.contains(prefix)) {
      String suffix = input.substring(i, len);
      String segSuffix = SegmentString(suffix);
      if (segSuffix != null) {
        memoized.put(input, prefix + " " + segSuffix);
        return prefix + " " + segSuffix;
    }
}

And you can do better!

Map<String, String> memoized;
Trie<String> dict;

String SegmentString(String input) 
{
    if (dict.contains(input)) 
        return input;
    if (memoized.containsKey(input) 
        return memoized.get(input);

    int len = input.length();
    foreach (StringBuilder word in dict.GetAll(input)) 
    {
        String prefix = input.substring(0, word.length);
        String suffix = input.substring(word.length, len);
        String segSuffix = SegmentString(suffix);
        if (segSuffix != null) 
        {
            memoized.put(input, word.ToString()  + " " + segSuffix);
            return prefix + " " + segSuffix;
        }
    }
    retrun null;
}

Using the Trieto find the recursive calls only when Trie reach a word end you will get o (z×n) where z is the length of the Trie.

Using a trie for string segmentation - time complexity?

3 Answers3