How does the UTF8 encoding support a range of 1 to 4 bytes in memory?

Question

I hope this is not a silly question at this time of night, but I can't seem to wrap my mind around it.

UTF-8 is a variable length encoding with a minimum of 8 bits per character. Characters with higher code points will take up to 32 bits.

So UTF-8 can encode unicode characters in a range of 1 to 4 bytes.

Does this mean that in a single UTF-8 encoded string, that one character may be 1 byte and another character may be 3 bytes?

If so, how in this example does a computer, when decoding from UTF-8, not try to treat those two separate characters as one 4 byte character?

@tripleee Apologies, I didn't know what keywords to use when searching. I went ahead and accepted it as a duplicate. — X33, Jan 05 '18 at 09:46
That's fine, we all know that search on Stack Overflow sucks even if you know exactly what to search for. I was able to find the duplicate via Google. Thanks for being quick to ack the dupe; this helps keep the site focused and clean! — tripleee, Jan 05 '18 at 09:47

Phylogenesis · Accepted Answer · 2018-01-05T09:43:02.810

If the data is held in memory as UTF-8 then, yes, it will be a variable width encoding.

However, the encoding allows a parser to know if the byte you are looking at is the start of a codepoint or an extra character.

From the Wikipedia page for UTF-8:

Bytes  Bits    First     Last      Bytes
  1      7     U+000000  U+00007F  0xxxxxxx
  2     11     U+000080  U+0007FF  110xxxxx 10xxxxxx
  3     16     U+000800  U+00FFFF  1110xxxx 10xxxxxx 10xxxxxx
  4     21     U+010000  U+10FFFF  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

How does the UTF8 encoding support a range of 1 to 4 bytes in memory?

1 Answers1