A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string (in the computer-science, not the linguistics, sense of the word suffix). It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.
A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string. It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.
Formal definitions
String: A string is an ordered sequence of symbols, each taken from a pre-defined, finite set. That set is called alphabet, or character set. The symbols are often referred to as characters.
Suffix: Given a string T of length n, a suffix of T is defined as a substring that starts at any position of T and ends at position n (the end of T).
Example: Let T:=abc
, then abc
,bc
and c
are suffixes of T, but a
and ab
are not.
Remark: Any string T of length n has exactly n distinct suffixes (as many as there are characters in it), because any character is the beginning of exactly one suffix.
Suffix array: Given a string T of length n, and a linear ordering on the alphabet, the suffix array of T is the lexicographically sorted list of all suffixes of T.
Example: Let T:=abcabx
and assume the 'natural' alphabetic ordering, i.e. a < b < c < d... < x < y < z
. Then the suffix array of T is as follows.
abcabx
abx
bcabx
bx
cabx
x
Implementation
The suffix array is usually not explicitly stored in memory. Instead it is represented as a list of integers, each representing the starting position of a suffix.
abcabx 012345
Example: Given T as defined above, and assume a numbering of its positions from 0 to 5, the suffix array is represented as the list [0,3,1,4,2,5]
.
The suffix-array
tag
Many of the questions tagged suffix-array
are related to one of the topics below.
- How to construct suffix arrays efficiently
- How to store, and possibly compress, them efficiently
- How to make use of them for various purposes, such as full-text search, detection of regularities in strings and text-compression
- How they are used in various fields of application, in particular bioinformatics, genetics and natural language processing
- What existing and/or ready-to-use implementations of any of the above are known
- Worst-case, average-case and empirical comparisons of time and space requirements of existing algorithms and implementation