3

A suffix array will index all the suffixes for a given list of strings, but what if you're trying to index all the possible unique substrings? I'm a bit new at this, so here's an example of what I mean:

Given the string

abcd

A suffix array indexes (at least to my understanding)

(abcd,bcd,cd,d)

I would like to index (all the substrings)

(abcd,bcd,cd,d,abc,bc,c,ab,b,a)

Is a suffix array what I'm looking for? If so, what do I do to get all the substrings indexed? If not, where should I be looking? Also what would I google for to contrast "all substrings" vs "suffix substrings"?

Ray Toal
  • 79,229
  • 13
  • 156
  • 215
Arjun
  • 1,603
  • 3
  • 17
  • 25
  • See this: http://stackoverflow.com/questions/2560262/generate-all-unique-substrings-for-given-string – Yang G Feb 22 '12 at 06:05

3 Answers3

15

The suffix array does what you need already, because every substring is a prefix of one of the suffixes. Specifically, given your suffix array

abcd bcd cd d

and assume you are looking for substring "bc", then you can find that by looking for all suffixes that start with "bc" (there is only one in this case, "bcd"). Since a suffix array is lexicographically sorted, finding all suffixes that share a certain prefix corresponds to a binary search across the suffix array, and the result will be one continuous range of entries of the suffix array.

However, there are optimised search methods using the suffix array combined with auxiliary data structures, such as the LCP (longest-common prefix) array, or wavelet trees. See Navarro's 2007 survey for a description of such methods (DOI 10.1145/1216370.1216372).

To take into account the comments made below, I suggest combining each suffix with the number of substrings it represents. In a simple example like the above this would be

4 abcd
3 bcd
2 bc
1 d

because, for example, the first suffix "abcd" represents the 4 substrings "a", "ab", "abc", "abcd". However, in a more complex example, say for the string "abcabxdabe", the first two entries of the suffix array would be

10 abcabxdabe
1 abe

because the second entry represents substrings "a", "ab" and "abe", but "a" and "ab" are also represented by the first entry.

How to calculate the number of substrings an entry represents? --> The length of the suffix minus the length of the longest prefix it has in common with the previous suffix. E.g. in the "abe" example, that is 3 (its length) minus 2 (the length of "ab", the longest prefix it shares with the previous entry). So these numbers can be generated in one pass over the suffix array, and even faster if you have also generated the LCP (longest-common prefix) array.

The next step would be to generate accumulated counts:

10 abcabxdabe
11 abe
16 abxdabe
...

and then to find an efficient way to make use of the accumulated counts. E.g. if you want to get the 13th substring lexicographically, you'd have to find the first entry that has an accumulated count greater than or equal to 13. That would be "16 abxdabe" above. Then remove the prefix it shares with the previous entry (yields "xdabe"), and then jump to the position after the 2nd character (because the previous entry has accumulated count 11, and 13-11==2), so you get "abxd" as the 13th substring lexicographically.

jogojapan
  • 63,098
  • 9
  • 87
  • 125
  • Nice, I had thought of this, however what if I was looking to find the nth substring lexicographically. Wouldn't I have to traverse the array and add entries for the non-suffix substrings? Because if I retrieved the substring at index n, this would only be counting the suffixes. Do I make any sense? Sorry if I don't.. – Arjun Feb 22 '12 at 07:12
  • I see, and yes, that makes sense. I misunderstood what you meant by "indexing" originally. But I believe what you are asking for can also be done using a slightly expanded suffix array. Specifically, you combine each suffix in the array with a number indicating how many unique substrings it represents. The _substrings it represents_ are basically the prefixes it contains, minus the prefixes already represented by previous suffixes. I will describe the details of this by editing the answer. – jogojapan Feb 22 '12 at 07:32
  • Wow thanks for that elegant solution. I am currently generating the LCP array, so this seems like it should work quite will. Thanks so much for your help, and I'll let you know if it works out! – Arjun Feb 24 '12 at 06:45
  • Good explanation. I really like answers of this type. – rpax Jun 11 '14 at 11:28
1

As has been answered already, substrings are prefixes of suffixes. Sometimes you'd like perhaps to go the other way and get suffixes of prefixes.

Beyond that, it's unclear what you're looking for with "unique substrings." I'd suggest you look up the words: type, token, maximal, supermaximal. You should have no trouble finding these in the suffix array literature.

Dale Gerdemann
  • 669
  • 5
  • 7
  • It occurs to me that there's a slightly more fun way to say the same thing. Once you get your suffix arrays up and running, collect a corpus of papers about suffix arrays and run them through your program. You will see then what technical vocabulary is used in the field. And if you keep your eyes open, you'll probably get a few surprises. And, of course, if you write a paper yourself, then run that through the suffix array. And don't forget mathematical kinds of strings with special properties. Enjoy! Better Living with Suffix Arrays! – Dale Gerdemann Feb 22 '12 at 20:26
  • Your SA corpus must include Abouelhoda et al. And I would add the "Linearized Suffix trees" paper by Kim et al. The latter has a good "review of the literature" section which really helps to get through some of the more obscure parts of Abouelhoda. For suffix arrays from a "recreational mathematics" perspective, read Klaus Shürman's book. – Dale Gerdemann Feb 22 '12 at 21:20
  • Your SA corpus must include Abouelhoda et al. And I would add the "Linearized Suffix trees" paper by Kim et al. The latter has a good "review of the literature" section which really helps to get through some of the more obscure parts of Abouelhoda. For suffix arrays from a "recreational mathematics" perspective, read Klaus Shürman's book. And (extra special tip) check out Gusfield's video tape-lectures at UC Davis. – Dale Gerdemann Feb 22 '12 at 21:40
  • What I mean by unique substrings is this: Lets say I had an array of 2 strings: [abcd,adcb]. First I'd find the substrings of abcd (a,ab,abc,abcd,b,bc,bcd,c,cd,d) then I'd find the substrings of adcb (a,ad,adc,adcb,d,dc,dcb,c,cb,b). Then I'd take the union of these sets: (a,ab,abc,abcd,b,bc,bcd,c,cd,d,ad,adc,adcb,dc,dcb,cb). These would be the unique substrings of the string array. – Arjun Feb 24 '12 at 06:51
  • And thanks for the literature suggestions, I'll definitely take a look soon, sounds fascinating. – Arjun Feb 24 '12 at 06:52
0

You should use a variation of 'Trie'. Essentially, if you have ABCD, create tree which is a merger of paths: root->A->B->C->D, root->B->C->D, root->C->D and root->D. Now, at every node keep a list of locations where string root->.->.->node was observed.

ElKamina
  • 7,549
  • 24
  • 41