Longest common substring via suffix array: do we really need unique sentinels?

Question

I am reading about LCP arrays and their use, in conjunction with suffix arrays, in solving the "Longest common substring" problem. This video states that the sentinels used to separate individual strings must be unique, and not be contained in any of the strings themselves.

Unless I am mistaken, the reason for this is so when we construct the LCP array (by comparing how many characters adjacent suffixes have in common) we don't count the sentinel value in the case where two sentinels happen to be at the same index in both the suffixes we are comparing.

This means we can write code like this:

for each character c in the shortest suffix
    if suffix_1[c] == suffix_2[c]
        increment count of common characters

However, in order to facilitate this, we need to jump through some hoops to ensure we use unique sentinels, which I asked about here.

However, would a simpler (to implement) solution not be to simply count the number of characters in common, stopping when we reach the (single, unique) sentinel character, like this:

set sentinel = '#'
for each character c in the shortest suffix
    if suffix_1[c] == suffix_2[c]
        if suffix_1[c] != sentinel
            increment count of common characters
        else
            return

Or, am I missing something fundamental here?

Intuitively, your suggestion sounds valid, however I am not an expert on this ... — 500 - Internal Server Error, Aug 29 '19 at 14:16
I have exactly the same question. The source code may help: https://github.com/williamfiset/Algorithms/tree/master/src/main/java/com/williamfiset/algorithms/strings , but I don't code Java — Tianyi Shi, Nov 01 '20 at 13:22
I don't even understand why a sentinel is needed in the first place. If it were in a suffix TREE a sentinel is needed to do proper tree traversal. However I can't really see the usefulness of a sentinel in a suffix ARRAY of a single string. In addition, even when constructing the suffix array of multiple strings, we can know which original string a character belongs to by looking at its position by constructing a range array e.g. [[0,4], [4,6], [6,12]] for three strings of length 4, 2, and 6 (then, if we have a position given by SA, say, 5, we know this character belongs to the second string) — Tianyi Shi, Nov 01 '20 at 13:26

score 0 · Answer 1 · answered Nov 01 '20 at 19:01

Actually I just devised an algorithm that doesn't use sentinels at all: https://github.com/BurntSushi/suffix/issues/14

When concatenating the strings, also record the boundary indexes (e.g. for 3 string of length 4, 2, 5, the boundaries 4, 6, and 11 will be recorded, so we know that concatenated_string[5] belongs to the second original string because 4<= 5 < 6).

Then, to identify which original string every suffix belongs to, just do a binary search.

score 0 · Answer 2 · edited Nov 01 '20 at 19:44

The short version is "this is mostly an artifact of how suffix array construction algorithms work and has nothing to do with LCP calculations, so provided your suffix array building algorithm doesn't need those sentinels, you can safely skip them."

The longer answer:

At a high level, the basic algorithm described in the video goes like this:

Construct a generalized suffix array for the strings T₁ and T₂.
Construct an LCP array for that resulting suffix array.
Iterate across the LCP array, looking for adjacent pairs of suffixes that come from different strings.
Find the largest LCP between any two such strings; call it k.
Extract the first k characters from either of the two suffixes.

So, where do sentinels appear in here? They mostly come up in steps (1) and (2). The video alludes to using a linear-time suffix array construction algorithm (SACA). Most fast SACAs for generating suffix arrays for two or more strings assume, as part of their operation, that there are distinct endmarkers at the ends of those strings, and often the internal correctness of the algorithm relies on this. So in that sense, the endmarkers might need to get added in purely to use a fast SACA, completely independent of any later use you might have.

(Why do SACAs need this? Some of the fastest SACAs, such as the SA-IS algorithm, assume the last character of the string is unique, lexicographically precedes everything, and doesn't appear anywhere else. In order to use that algorithm with multiple strings, you need some sort of internal delimiter to mark where one string ends and another starts. That character needs to act as a strong "and we're now done with the first string" character, which is why it needs to lexicographically precede all the other characters.)

Assuming you're using a SACA as a black box this way, from this point forward, those sentinels are completely unnecessary. They aren't used to tell which suffix comes from which string (this should be provided by the SACA), and they can't be a part of the overlap between adjacent strings.

So in that sense, you can think of these sentinels as an implementation detail needed to use a fast SACA, which you'd need to do in order to get the fast runtime.

Longest common substring via suffix array: do we really need unique sentinels?

2 Answers2

Linked