Recall that the Thue–Morse sequence$^{[1]}$$\!^{[2]}$$\!^{[3]}$ is an infinite binary sequence that begins with $\,t_0 = 0,$ and whose each prefix $p_n$ of length $2^n$ is immediately followed by its bitwise complement (i.e. obtained by flipping $0\to1$ and $1\to0$): $$ \begin{array}{c|cc}&t_0&t_1&t_2&t_3&t_4&t_5&t_6&t_7&\!\!\!\dots\\\hline p_0&0\\ p_1&0&\color{red}1\\ p_2&0&1&\color{red}1&\color{red}0\\ p_3&0&1&1&0&\color{red}1&\color{red}0&\color{red}0&\color{red}1\\ \cdots&\cdots\!\! \end{array} $$ We are interested in contiguous substrings of these prefixes. For a string $\mathcal{S}$ of length $\ell$, the total number of its substrings, including the empty substring $\langle\unicode{x202f}\rangle$ and the string $\mathcal{S}$ itself, is $(\ell^2+\ell+2)/2.$ Hence, the total number of substrings in $p_n$ is $(4^n+2^n+2)/2.$ Clearly, not all of those substrings are distinct for $n>1$. For example, $p_2 = \langle0\,1\,1\,0\rangle$ has $11$ substrings in total, but only $9$ distinct substrings: $$ \begin{array}{l|cc}&\langle\!\!\!&0&\color{#808080}1&\color{#b8b8b8}1&\color{#c8c8c8}0&\!\!\!\rangle\\\hline 1&\langle\!\!\!&&&&&\!\!\!\rangle\\\hdashline 2&\langle\!\!\!&0&&&&\!\!\!\rangle\\ &\langle\!\!\!&&&&\color{#c8c8c8}0&\!\!\!\rangle\\\hdashline 3&\langle\!\!\!&&\color{#808080}1&&&\!\!\!\rangle\\ &\langle\!\!\!&&&\color{#b8b8b8}1&&\!\!\!\rangle\\\hdashline 4&\langle\!\!\!&0&\color{#808080}1&&&\!\!\!\rangle\\ 5&\langle\!\!\!&&\color{#808080}1&\color{#b8b8b8}1&&\!\!\!\rangle\\ 6&\langle\!\!\!&&&\color{#b8b8b8}1&\color{#c8c8c8}0&\!\!\!\rangle\\ 7&\langle\!\!\!&0&\color{#808080}1&\color{#b8b8b8}1&&\!\!\!\rangle\\ 8&\langle\!\!\!&&\color{#808080}1&\color{#b8b8b8}1&\color{#c8c8c8}0&\!\!\!\rangle\\ 9&\langle\!\!\!&0&\color{#808080}1&\color{#b8b8b8}1&\color{#c8c8c8}0&\!\!\!\rangle \end{array} $$ Among these, $\langle0\rangle$ and $\langle1\rangle$ appear in $p_2$ twice, so the fraction of distinct substrings in $p_2$ is $\,\stackrel9{}\!\!\unicode{x2215}_{\!\unicode{x202f}11}\!.$

Can we find a simple general formula for $\mathscr D_n$, the number of distinct substrings in $p_n$? Let's try to compute a few terms: $$2,\,4,\,9,\,28,\,101,\,393,\,1561,\,6233,\,24921,\,99673,\,398681,\,1594713,\,6378841,\,\dots$$ These few terms can be computed by a brute-force approach, but using Coolwater's program from here we can compute hundreds of thousands more. It is not too difficult to discover that for $n>2$ all known terms match a simple formula: $$\mathscr D_n\stackrel{\color{#d0d0d0}?}=\frac{73\cdot4^n+704}{192}\color{#d0d0d0}{,\,\,\text{for}\,\,n>2}\tag{$\diamond$}$$ Somewhat oddly, the three initial terms $\mathscr D_0=2,\,\mathscr D_1=4,$ and $\mathscr D_2=9$ do not match the general formula $(\diamond)$, which results in non-integer rational values for these indexes. I conjecture that the general formula $(\diamond)$ is valid for all $n>2$.

$$\bbox[LemonChiffon]{\begin{array}{c} \\ \hspace{1in}\text{Could you suggest a way to prove this conjecture?}\hspace{1in}\\ \vphantom. \end{array}}$$ If the conjecture turns out to be true, then we have a remarkable corollary that for $n\to\infty$ the fraction of distinct substrings in the prefixes $p_n$ tends to a quite surprising limit:

$$\mathscr L=\lim_{n\to\infty}\frac{73\cdot4^n+704}{192}{\large/}\frac{4^n+2^n+2}2=\frac{73}{96}.\tag{$\small\spadesuit$}$$

Vladimir Reshetnikov
  • 45,303
  • 7
  • 151
  • 282
  • 3
    In a curious way, I noticed that in Theorem 14 page 9 of [this document] (https://www.mimuw.edu.pl/~rytter/MYPAPERS/thue.pdf) the same type of constant is uprising... (but I am unable to say if there is a connection with your issue). – Jean Marie Apr 16 '20 at 05:34
  • Thanks, it certainly seems related. – Vladimir Reshetnikov Apr 16 '20 at 07:41
  • 1
    A related question: https://math.stackexchange.com/q/1821082/19661 – Vladimir Reshetnikov Apr 16 '20 at 17:44
  • 2
    As $n>2$, you might want to simplify : $\mathscr D_{n + 3} = \frac{73\cdot 4^n + 11}{3}$ ? – wazdra Apr 23 '20 at 12:06
  • isn't this just one plus the number in the paper linked by Jean Marie? it seems like the only difference between your count and the count in the paper is that you're including the empty string. am I missing something? – user125932 Apr 24 '20 at 18:04

2 Answers2


It should be easy to derive the conjecture from the results of [1]. In particular, Brlek gives in Proposition 4.2 the precise value of the number $P(n,m)$ of factors of length $m$ of $p_n$ (up to the empty word, which is not included). But more interestingly, he gives a table of the small values of $P_n(m)$. Here is this table (I added the empty word in the first column): \begin{array}{c|cc} n \backslash m & 0& 1 & 2 & 3 & 4 & 5 &6 & 7 & 8 & 9 & 10 & 11 & 12 &13 &14 &15 &16 &17 &18 &19 &20 &21 \\ \hline 1&1&2&1\\ 2&1&2&\mathbf{3}&2&1\\ 3&1&2&4&\mathbf{6}&5&4&3&2&1\\ 4&1&2&4&6&10&\mathbf{12}&11&10&9&8&7&6&5&4&3&2&1\\ 5&1&2&4&6&10&12&16&20&22&\mathbf{24}&23&22&21&20&19&18&17&16&15&14&13&12& \dotsm\\ 6&1&2&4&6&10&12&16&20&22&24&28&32&36&40&42&44&46&\mathbf{48}&47&46&45&44& \dotsm \end{array}

As you can see, there are two types of coefficients in this table. Starting from the coefficients in bold, in position $(k, 2^{k-2} + 1)$ for $k > 0$ (that is $\mathbf{6}$, $\mathbf{12}$, $\mathbf{24}$, $\mathbf{48}$, etc.) the coefficients decrease by $1$ in each line. Thus it is easy to take the sum of these coefficients.

The other coefficients, apart from the first values of $m$, also follow a regular pattern. One has $P(n,m) = P(n-1,m)$ for $m \leqslant 2^{n-3}$. Then the coefficients between $P(n, 2^{n-3} + 1)$ and $P(n, 2^{n-3} + 2^{n-4} + 1)$ form an arithmetic progression of reason $4$ (see $24, 28, 32, 36, 40$ in line 6) and then the coefficients between $P(n, 2^{n-3} + 2^{n-4} + 1)$ and $P(n, 2^{n-2} + 1)$ form an arithmetic progression of reason $2$ (see $40,42,44,46,48$ in line 6).

I am a bit lazy to make the complete computation but, with these observations in hand, it should not be too difficult to sum up the coefficients in each line to get the value of ${\cal D}_n$.

[1] S. Brlek, Enumeration of factors in the Thue-Morse word, Discrete Applied Math. 24 (1989), 83-96.

J.-E. Pin
  • 36,191
  • 3
  • 31
  • 82

J.-E. Pin has described the following fact in detail according to Proposition 4.2 in Enumeration of factors in the Thue-Morse word by Srećko Brlek.

Formulas of $P(n,m)$. Let $P(n,m)$ be the number of distinct substrings of length $m$ of $p_n$, $0\le m\le2^n$. We have $$\begin{align} &\begin{array}{c|cccccccc} P_n(m)& m=1 & m=2 & m=3 & m=4 & m=5 &m=6 &m=7 &m=8\\ \hline n=1&2&1\\ n=2&2&3&2&1\\ n=3&2&4&6&5&4&3&2&1\\ \end{array}\\ \text{If } n\ge4,\\ &P_n(m)=\begin{cases} P_{n-1}(m)\quad &\text{ for } m\le2^{n-3}+1,\\ 4(m-1)-2^{n-3}\quad &\text{ for } 2^{n-3}+1\le m\le 2^{n-3} + 2^{n-4}+1,\\ 2^{n-2}+2(m-1)\quad &\text{ for } 2^{n-3} + 2^{n-4}+1\le m\le 2^{n-2}+1,\\ 2^{n} -(m-1)\quad &\text{ for } 2^{n-2}+1\le m.\\ \end{cases} \end{align}$$

As defined in question, $\mathscr D_{n} = \sum_{m=0}^{2^n}p(n,m)$.

Proposition: $\mathscr D_{n} = \dfrac{73\cdot 4^{n-3} + 11}{3}$ for $n\ge3$.
Proof: Let $\mathscr C_{n}=\sum_{m=0}^{2^{n-2}}p(n,m)$. Let us prove $\mathscr C_n=\dfrac{38\cdot4^{n-3}-9\cdot2^{n-2}+22}6$ by induction on $n$.

The base case, $\mathscr C_3=7$ can be verified directly.

Suppose it is true for $n$.

$$\begin{align}\mathscr C_{n+1} &= \sum_{m=0}^{2^{n-2}}p(n+1,m)\ +\sum_{m=2^{n-2}+1}^{2^{n-2}+2^{n-3}}p(n+1,m) \ +\sum_{m=2^{n-2}+2^{n-3}+1}^{2^{n-1}}p(n+1,m) \\ &= \sum_{m=0}^{2^{n-2}}p(n,m)\ +\sum_{m=2^{n-2}+1}^{2^{n-2}+2^{n-3}}\left(4(m-1)-2^{n-2}\right)\ +\sum_{m=2^{n-2}+2^{n-3}+1}^{2^{n-1}} \left(2^{n-1}+2(m-1)\right) \\ &=\mathscr C_n+2^{n-3}(-2^{n-2}) +2^{n-3}\cdot2^{n-1}\ +\sum_{m=2^{n-2}+1}^{2^{n-2}+2^{n-3}}4(m-1)\ +\sum_{m=2^{n-2}+2^{n-3}+1}^{2^{n-1}} 2(m-1) \\ &= \mathscr C_n+2^{2n-5} +4\cdot2^{n-3}(2^{n-1}+2^{n-3}-1)/2+2\cdot2^{n-3}(2^{n-1}+2^{n-2}+2^{n-3}-1)/2\\ &= \dfrac{38\cdot4^{n-3}-9\cdot2^{n-2}+22}6+19\cdot4^{n-3} -3\cdot2^{n-3}\\ &= \dfrac{38\cdot4^{n-2}-9\cdot2^{n-1}+22}6.\\ \end{align}$$

So we have proved the formula for $\mathscr C_n$. $$\begin{align} \mathscr D_{n} &=\mathscr C_{n} +\sum_{m=2^{n-2}+1}^{2^{n}}P_{n}(m) \\ &= \dfrac{38\cdot4^{n-3}-9\cdot2^{n-2}+22}6 + \sum_{m=2^{n-2}+1}^{2^n}2^n-(m-1)\\ &= \dfrac{38\cdot4^{n-3}-9\cdot2^{n-2}+22}6 + (2^n-2^{n-2})(2^{n+1}-2^{n-2}-(2^n-1))/2\\ &= \frac{73\cdot 4^{n-3} + 11}{3}. \quad \blacksquare \end{align}$$

As user125932 points out in this comment, the formula for $\mathscr D_n$ appears in Theorem 14 of on the structure of compacted subword graphs of Thue-Morse words and their applications by Jakub Radoszewski and Wojciech Rytter.

Theorem 14. The number of different factors of $p_n$ for $n\ge4$ equals $\frac{73}{192} |p_n|^2 + \frac83$.

Here factors means non-empty substrings while empty string is counted in $\mathscr D_n$. Note that $|p_n|=2^n$ and $\frac{704}{192}=\frac83+1$.

The formalization can be generalized. Given a string $w$ made of $0$ and $1$, define sequence ${}_wP$, that begins with ${}_wp_0=w$, and ${}_wp_{n+1}$ is ${}_wp_{n}$ followed by its bitwise complement.

  • The Thue-Morse sequence $p_0, p_1, p_2,\cdots$ is just sequence ${}_{0}P$.
  • For example, sequence ${}_{00}P$ is $00, 00\underline{11}, 00\,\underline{11}\,\underline{1100}, \cdots$.
  • For another example, sequence ${}_{01011}P$ is $01011, 01011\,\underline{10100}, 01011\,\underline{10100}\,\underline{1010001011}, \cdots$.

Let ${}_w\mathscr D_n $ be the number of distinct substrings in ${}_wp_n$. This question and answers give the formula for ${}_0\mathscr D_n$. It looks like we also have the following formulas. It might be interesting to prove them and generalize them further.

$$\begin{align} {}_{00}\mathscr D_{n}&=\frac{73\cdot4^{n-2}+11}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{000}\mathscr D_{n}&=219\cdot4^{n-3}+1\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{001}\mathscr D_{n}&=219\cdot4^{n-3}+9\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{010}\mathscr D_{n}&=219\cdot4^{n-3}-23\color{#d0d0d0}{,\ \text{for}\,\,n\ge4}\\ {}_{0001}\mathscr D_{n}&=\frac{73\cdot4^{n-1}+41}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0100}\mathscr D_{n}&=\frac{73\cdot4^{n-1}+41}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0101}\mathscr D_{n}&=\frac{73\cdot4^{n-1}-13}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{01000}\mathscr D_{n}&=\frac{1825\cdot4^{n-3}+59}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{01011}\mathscr D_{n}&=\frac{1825\cdot4^{n-3}+59}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{010001}\mathscr D_{n}&=219\cdot4^{n-2}+35\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0000001}\mathscr D_{n}&=\frac{3577\cdot4^{n-3}+107}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{01010101}\mathscr D_{n}&=\frac{73\cdot4^{n}-157}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{011001111}\mathscr D_{n}&=1971\cdot4^{n-3}+81\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0010011100}\mathscr D_{n}&=\frac{1825\cdot4^{n-2}+323}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{01011010000}\mathscr D_{n}&=\frac{8833\cdot4^{n-3}+371}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{011111100000}\mathscr D_{n}&=219\cdot4^{n-1}+27\color{#d0d0d0}{,\ \text{for}\,\,n\ge2}\\ {}_{0101010101010}\mathscr D_{n}&=\frac{12337\cdot4^{n-3}-2389}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge4}\\ {}_{01010101010111}\mathscr D_{n}&=\frac{3577\cdot4^{n-2}+401}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{010101000101111}\mathscr D_{n}&=5475\cdot4^{n-3}+231\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0000010000001111}\mathscr D_{n}&=\frac{73\cdot4^{n+1}+791}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{010110011101010001}\mathscr D_{n}&=1971\cdot4^{n-2}+381\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0101010101010101010}\mathscr D_{n}&=\frac{26353\cdot4^{n-3}-5317}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge4}\\ {}_{0101010101010101111}\mathscr D_{n}&=\frac{26353\cdot4^{n-3}+731}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{001001001001001001001}\mathscr D_{n}&=10731\cdot4^{n-3}-351\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0001011000101100010110001011}\mathscr D_{n}&=\frac{3577\cdot4^{n-1}-1021}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge2}\\ {}_{0101010101010101010101010101010101010101010101010}\mathscr D_{n}&=\frac{175273\cdot4^{n-3}-37237}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge4}\\ {}_{000000000000000000000000000000000000000000000000000000001}\mathscr D_{n}&=79059\cdot4^{n-3}+2169\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ \end{align}$$

  • 1,272
  • 6
  • 14