Questions tagged [suffix-array]

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string (in the computer-science, not the linguistics, sense of the word suffix). It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string. It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

Formal definitions

String: A string is an ordered sequence of symbols, each taken from a pre-defined, finite set. That set is called alphabet, or character set. The symbols are often referred to as characters.

Suffix: Given a string T of length n, a suffix of T is defined as a substring that starts at any position of T and ends at position n (the end of T).

Example: Let T:=abc, then abc,bc and c are suffixes of T, but a and ab are not.

Remark: Any string T of length n has exactly n distinct suffixes (as many as there are characters in it), because any character is the beginning of exactly one suffix.

Suffix array: Given a string T of length n, and a linear ordering on the alphabet, the suffix array of T is the lexicographically sorted list of all suffixes of T.

Example: Let T:=abcabx and assume the 'natural' alphabetic ordering, i.e. a < b < c < d... < x < y < z. Then the suffix array of T is as follows.

abcabx
abx
bcabx
bx
cabx
x

Implementation

The suffix array is usually not explicitly stored in memory. Instead it is represented as a list of integers, each representing the starting position of a suffix.

abcabx 012345

Example: Given T as defined above, and assume a numbering of its positions from 0 to 5, the suffix array is represented as the list [0,3,1,4,2,5].

The suffix-array tag

Many of the questions tagged suffix-array are related to one of the topics below.

  • How to construct suffix arrays efficiently
  • How to store, and possibly compress, them efficiently
  • How to make use of them for various purposes, such as full-text search, detection of regularities in strings and text-compression
  • How they are used in various fields of application, in particular bioinformatics, genetics and natural language processing
  • What existing and/or ready-to-use implementations of any of the above are known
  • Worst-case, average-case and empirical comparisons of time and space requirements of existing algorithms and implementation
145 questions
46
votes
1 answer

Suffix Array Algorithm

After quite a bit of reading, I have figured out what a suffix array and LCP array represents. Suffix array: Represents the _lexicographic rank of each suffix of an array. LCP array : Contains the maximum length prefix match between two consecutive…
Spandan
  • 2,058
  • 4
  • 21
  • 35
41
votes
3 answers

What's the current state-of-the-art suffix array construction algorithm?

I'm looking for a fast suffix-array construction algorithm. I'm more interested in ease of implementation and raw speed than asymptotic complexity (I know that a suffix array can be constructed by means of a suffix tree in O(n) time, but that takes…
Cameron
  • 86,330
  • 19
  • 177
  • 216
21
votes
2 answers

How does LCP help in finding the number of occurrences of a pattern?

I have read that the Longest Common Prefix (LCP) could be used to find the number of occurrences of a pattern in a string. Specifically, you just need to create the suffix array of the text, sort it, and then instead of doing binary search to find…
Cratylus
  • 49,824
  • 60
  • 195
  • 327
17
votes
2 answers

Suffix Arrays vs Suffix Trees

I just wanna know, when a suffix tree is superior to an enhanced suffix array. After reading Replacing suffix trees with enhanced suffix arrays i don't see a reason to use suffix trees anymore. Some methods can get complicated, but you can do…
14
votes
7 answers

Finding the longest repeated substring

What would be the best approach (performance-wise) in solving this problem? I was recommended to use suffix trees. Is this the best approach?
kukit
  • 307
  • 1
  • 3
  • 8
13
votes
8 answers

Longest Non-Overlapping Repeated Substring using Suffix Tree/Array (Algorithm Only)

I need to find the longest non-overlapping repeated substring in a String. I have the suffix tree and suffix array of the string available. When overlapping is allowed, the answer is trivial (deepest parent node in suffix tree). For example for…
12
votes
2 answers

How to sort array suffixes in block sorting

I'm reading the block sort algorithm from the Burrows and Wheeler paper. This a step of the algorithm: Suppose S= abracadabra Initialize an array W of N words W[0, ... , N - 1], such that W[i] contains the characters S'[i, ... , i + k - 1] arranged…
erandros
  • 1,645
  • 3
  • 17
  • 42
11
votes
3 answers

Suffix array nlogn creation

I have been learning suffix arrays creation, & i understand that We first sort all suffixes according to first character, then according to first 2 characters, then first 4 characters and so on while the number of characters to be considered is…
user2565192
  • 744
  • 1
  • 7
  • 17
11
votes
1 answer

Errata in the original paper on suffix arrays?

I'm looking at the pseudo-code given in figure 3 of the original paper introducing suffix arrays "SUFFIX ARRAYS: A NEW METHOD FOR ON-LINE STRING SEARCHES". I can't figure out the logic for lines 4 and 5 (indexing from 0). The lines reads: else if…
nomad
  • 1,800
  • 1
  • 17
  • 33
11
votes
4 answers

Effcient way to find longest duplicate string for Python (From Programming Pearls)

From Section 15.2 of Programming Pearls The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c When I implement it in Python using suffix-array: example = open("iliad10.txt").read() def comlen(p, q): i = 0 for x…
Hanfei Sun
  • 39,245
  • 33
  • 107
  • 208
11
votes
1 answer

Shortest uncommon substring: shortest substring of one string, that is not a substring of another string

We need to find shortest uncommon substring between two strings i.e. if we have two strings a and b so we need to find the length of shortest substring of a that is not a substring of b. How to solve this problem using suffix array ? To be solved…
Aman Singhal
  • 817
  • 1
  • 10
  • 23
10
votes
4 answers

strcmp for python or how to sort substrings efficiently (without copy) when building a suffix array

Here's a very simple way to build an suffix array from a string in python: def sort_offsets(a, b): return cmp(content[a:], content[b:]) content = "foobar baz foo" suffix_array.sort(cmp=sort_offsets) print suffix_array [6, 10, 4, 8, 3, 7, 11, 0,…
ephes
  • 1,411
  • 1
  • 12
  • 17
8
votes
3 answers

Constructing a Good Suffix Table - Understanding an example

I'm really trying to understand an example on how to construct a good suffix table for a given pattern. The problem is, I'm unable to wrap my head around it. I've looked at numerous examples, but do not know where the numbers come from. So here…
northerner
  • 651
  • 2
  • 16
  • 30
8
votes
9 answers

Efficient String/Pattern Matching in C++ (suffixarray, trie, suffixtree?)

I'm looking for an efficient data structure to do String/Pattern Matching on an really huge set of strings. I've found out about tries, suffix-trees and suffix-arrays. But I couldn't find an ready-to-use implementation in C/C++ so far (and…
Constantin
  • 7,926
  • 12
  • 71
  • 112
7
votes
1 answer

Using suffix array algorithm for Burrows Wheeler transform

I've sucessfully implemented a BWT stage (using regular string sorting) for a compression testbed I'm writing. I can apply the BWT and then inverse BWT transform and the output matches the input. Now I wanted to speed up creation of the BW index…
Bim
  • 866
  • 1
  • 9
  • 28
1
2 3
9 10