9

Consider a string of length $n \geq 3$ over an alphabet $\{1,\dots, \sigma\}$. An edit operation is a single symbol insert, deletion or substitution. The edit distance between two strings is the minimum number of edit operations needed to transform one string into the other one. Given a string $S$ of length $n$ with $S_i \in \{1,\dots, \sigma\}$, my question relates to the number of distinct strings which are edit distance at most $3$ from $S$.

Let us write $g_{k, \sigma}(S)$ for the number of distinct strings over the alphabet $\{1,\dots, \sigma\}$ which are edit distance at most $k$ from $S$, i.e. $g_{k,\sigma}(S) = |\{S' : d(S', S) \leq k\}|$ where $d(-,-)$ is the edit distance.

Let $X_n$ be a random variable representing a random string over the alphabet $\{1,\dots, \sigma\}$ of length $n$, with the symbols chosen uniformly and independently.

This leads directly to my question:

Let $X_n$ be a random variable representing a random string of length $n$, with the symbols chosen uniformly and independently. What is:

$$\mathbb{E}(g_{3, \sigma}(X_n))\;?$$

For $\sigma=2$ we can get an explicit formula $(40+6n-4n^2)/2^n-83/2+(331/12)n-6n^2+(2/3)n^3$. So my question is, what does the dependency on the alphabet size $\sigma$ look like?

N. F. Taussig
  • 66,403
  • 13
  • 49
  • 69
graffe
  • 4,424
  • 2
  • 18
  • 50
  • Can you be a bit more precise about what you want? Like, $\lim_{n \to \infty} \frac{1}{n^3}\mathbb{E}(g_{3,\sigma}(X_n))$? – mathworker21 Dec 23 '20 at 00:37
  • @mathworker21. Ideally a closed form formula for any $\sigma$ and $n$ but a limit as you describe would be great too. – graffe Dec 23 '20 at 06:15
  • Must the transformed string have the same length $n$ as the original (every delete operation paired with an insertion) or can their lengths differ by as many as 3? – Bill Vander Lugt Dec 27 '20 at 23:24
  • @BillVanderLugt. They can differ by 6 in fact. In one case three symbols could be deleted and in the other three symbols could be inserted. – graffe Dec 28 '20 at 06:47
  • In the codegolf question you link to it is required that the starting and ending strings have the same length. – Ross Millikan Dec 28 '20 at 21:38
  • @RossMillikan You are right. I don’t mind which version we consider. – graffe Dec 28 '20 at 21:44
  • 1
    One easy limit: For fixed $n$, as $\sigma\rightarrow \infty$ we will have something like $E(g_{3,\sigma}(X_n)) \sim \left(\binom{n}{3}+(n+1)\binom{n}{2}+\binom{n+2}{2}n+\binom{n+3}{3}\right)\sigma^3 + \mathcal{O}(\sigma^2)$. The intuition being that for large $\sigma$ there is a vanishing probability of having two identical letters, which simplifies the calculation. – Yly Dec 28 '20 at 23:09

1 Answers1

1

Varying v. Unchanged String Length

If, as you initially indicated in response to my comment, the length of the transformed string can differ from the length of the original, then this problem becomes vastly more difficult because the set of distinct editing operations (operations that might potentially yield a distinct result) includes all 18 of the following:

  • length +3 = 3 insertions
  • length +2 = 2 insertions and 0 or 1 substitutions
  • length +1 = 1 insertion and 0, 1, or 2 substitutions
  • length unchanged = 0, 1, 2, or 3 substitutions; 1 deletion, 1 insertion, and 0 or 1 substitutions
  • length -1 = 1 deletion and 0, 1, or 2 substitutions
  • length -2 = 2 deletions and 0 or 1 substitutions
  • length -3 = 3 deletions

Whenever multiple insertions or multiple deletions are performed, moreover, counting becomes inordinately difficult. If, on the other hand, we require that the length remain unchanged, we have only 6 editing combinations to consider and the problem becomes more tractable because none of those 6 combinations involves multiple insertions or multiple deletions. Indeed, the counting for each of the six cases becomes relatively straightforward; the trickiest bit is discounting to avoid double-counting instances when two different editing operations will produce the same string--a problem solved in an answer to another question.

The Six Cases and the Danger of Overcounting
To get our bearings initially, we can generalize this logic:

  • The string must maintain $n$ symbols.
  • The expected number of groups of identical symbols is $\frac{n+1}{\sigma}$
  • The expected number of adjacent, identical symbol pairs is $\frac{n-1}{\sigma}$
  • The number of ends is 2.

A fine-grained consideration of the five possible types of single edits thus yields:

  • The number of possible substitutions is $n(\sigma-1)$
  • The expected number of shrinkages of a group of identical symbols is $\frac{n+1}{\sigma}$
  • The expected number of expansions of a group of identical symbols with the same symbol is $\frac{n+1}{\sigma}$
  • The expected number of insertions into a group of identical symbols with the same symbol is $\frac{n-1}{\sigma}$
  • The number of possible insertions of a different character at the beginning or end is $2(\sigma-1)$

We can now apply that basic logic to each of our six cases:

  1. no edits
    Performing no edits whatsoever yields only the original string, so 1 result for this case.

  2. one substitution
    There are $n$ different symbols and $\sigma-1$ ways each can be substituted into a different symbol, so $n(\sigma-1)$ results.

  3. two substitutions
    There are $\binom{n}{2}$ different pairs and $(\sigma-1)^2$ ways to modify each: $\binom{n}{2}(\sigma-1)^2$ results.

  4. three substitutions
    There are $\binom{n}{3}$ different trios and $(\sigma-1)^3$ ways to modify each: $\binom{n}{3}(\sigma-1)^3$.

  5. one deletion, one insertion, no substitutions
    For this case, we can generalize this solution for $\sigma=2$ to any $\sigma$, using the same logic to avoid double-counting those instances where two substitutions would yield the same result as one deletion and one insertion.

Let's count the cases where the insertion is to the left of the deletion and then multiply by 2. The combined effect of the insertion and the deletion is to shift all bits between them to the right while replacing the first one and removing the last one. This result can also be achieved by at most substitutions, so we need >2. Inserting within a run of s has the same effect as inserting at the end of the run. Thus we can count all insertions with different effects once by always inserting the bit complementary to the one to the right of the insertion. Similarly, a deletion within a run has the same effect as a deletion at the start of the run, so we should only count deletions that follow a change between 0 and 1. That gives us an initial count of:

$2\cdot\frac12\sum_{k=3}^n(n+1-k)=\sum_{k=1}^{n-2}k=\frac{(n-1)(n-2)}2\;$

Because the tricky logic to prevent double-counting carries directly over, the only modification required is to substitute a variable $\sigma$ for the fixed $\sigma=2$:

$2\cdot\frac{1}{\sigma}\sum_{k=3}^n(n+1-k)=2\cdot\frac{1}{\sigma}\sum_{k=1}^{n-2}k=\frac{(n-1)(n-2)}{\sigma}\;$

The overcount of results that have already been tallied as two substitutions can be calculated as follows when $\sigma=2$:

If there are no further changes in the shifted bits other than the one preceding the deletion, then only the bits next to the insertion and deletion change, and we can achieve that with 2 substitutions, so we have to subtract

$\sum_{k=3}^n\left(\frac12\right)^{k-2}(n+1-k)=\sum_{k=1}^{n-2}\left(\frac12\right)^{n-k-1}k=n-3+2^{-(n-2)}\;$

Again, our only modification is to substitute $\sigma$ for 2:

$\sum_{k=3}^n\left(\frac1{\sigma}\right)^{k-2}(n+1-k)=\sum_{k=1}^{n-2}\left(\frac1{\sigma}\right)^{n-k-1}k=n-3+{\sigma}^{-(n-2)}\;$

Also, if the entire range of shifted bits consists of alternating zeros and ones, then swapping the insertion and the deletion yields the same effect, so in this case we were double-counting and need to subtract

$\sum_{k=3}^n\left(\frac12\right)^{k-1}(n+1-k)\;$

Swapping in $\sigma$ a final time yields:

$\sum_{k=3}^n\left(\frac1{\sigma}\right)^{k-1}(n+1-k)\;$

These two overcounts (which, alas, cannot be combined as cleanly as when the symbols are binary) are then subtracted from the initial count of deletion/insertion operations to yield the overall results produced by this case, but not by case 3 above:

$\frac{(n-1)(n-2)}{\sigma}\ - \left(n-3+{\sigma}^{-(n-2)}\right) - \sum_{k=3}^n\left(\frac1{\sigma}\right)^{k-1}(n+1-k)\;$

  1. one deletion, one insertion, one substitution
    That same calculation carries over to the final case. Here, however, each combination of one deletion and one insertion--likewise discounted to avoid double-counting the triple substitutions already tallied in case 4 above--is accompanied by a third edit: a substitution involving one of the $n-1$ original symbols remaining after the deletion. Since each of these $(n-1)$ symbols admits $(\sigma-1)$ novel substitutions, the total count for the sixth and final case becomes:

$\left(\frac{(n-1)(n-2)}{\sigma}\ - \left(n-3+{\sigma}^{-(n-2)}\right) - \sum_{k=3}^n\left(\frac1{\sigma}\right)^{k-1}(n+1-k)\right)(n-1)(\sigma-1);$

Summing the (previously uncounted) results produced by each of these six cases should yield the expected count when the length of the string remains unchanged. It's ugly (perhaps unnecessarily), but I hope correct.