376

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?

In what way are these helpful for programmers?

Hakan Fıstık
  • 11,376
  • 8
  • 74
  • 105
web dunia
  • 8,634
  • 17
  • 47
  • 63
  • 6
    very related: [UTF-8 vs Unicode](http://stackoverflow.com/questions/643694/utf-8-vs-unicode) – Tobias Kienzler Jul 24 '13 at 07:21
  • The best site to refer would be : http://msdn.microsoft.com/en-us/library/dd374081(VS.85).aspx – RamSri Sep 27 '10 at 21:37
  • http://www.tugay.biz/2016/07/what-is-ascii-and-unicode-and-character.html – Koray Tugay Jul 10 '16 at 18:08
  • [What is Unicode, UTF-8, UTF-16?](https://stackoverflow.com/q/2241348/995714), [What is the difference between UTF-8 and Unicode](https://stackoverflow.com/q/643694/995714) – phuclv Feb 24 '19 at 14:14

2 Answers2

503

Going down your list:

  • "Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
  • UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
  • UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
  • UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
  • UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
  • ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
  • ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.

There's more on my Unicode page and tips for debugging Unicode problems.

The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.

Jon Skeet
  • 1,261,211
  • 792
  • 8,724
  • 8,929
  • I actually think of ANSI as [Code Page 437](http://en.wikipedia.org/wiki/Code_page_437Code), as that was what ANSI Art used. However, I don't think that is available in ASP.Net – Doug Moore Aug 06 '12 at 19:29
  • 6
    The term "ANSI" when applied to Microsoft's 8-bit code pages is a misnomer. They were based on drafts submitted for ANSI standardization, but ANSI itself never standardized them. Windows-1252 (the code page most commonly referred to as "ANSI") is similar to ISO 8859-1 (Latin-1), except that Windows-1252 has printable characters in the range 0x80..0x9F, where ISO 8859-1 has control characters in that range. Unicode also has control characters in that range. https://en.wikipedia.org/wiki/Windows_code_page – Keith Thompson Jun 15 '15 at 23:59
  • @JonSkeet, I have some web pages that send email messages. Currently they use UTF8. Should I be thinking about changing them back to UTF7? – jp2code Oct 01 '15 at 13:14
  • 1
    @jp2code: I wouldn't - but you need to distinguish between "content that is sent back via HTTP from the web server" and "content that is sent via email". It's not the web page content that sends the email - it's the app behind it, presumably. The web content would be best in UTF-8; the mail content *could* be in UTF-7, although I suspect that it's fine to keep that in UTF-8 these days. – Jon Skeet Oct 01 '15 at 13:39
  • As the question no longer mentions ASP.NET anywhere (after edits done quite some time ago), I refactored the answer to be similarly platform-agnostic. In particular, the comments above re: UTF-16 != Unicode no longer make a lot of sense. – tripleee Oct 30 '15 at 08:48
  • UTF-7 is mandated by e.g. IMAP as a protocol-level encoding for some things, but there is no reason to use it where you get to choose the encoding yourself. In email, more and more systems just use `charset="utf-8"` in the email body, possibly with `Content-Transfer-Encoding: quoted-printable` or even `base64` to ensure that the encoded email is 7-bit clean. In limited systems where you know everything is 8-bit clean, there is no need for that, of course. – tripleee Dec 11 '15 at 10:21
  • 2
    For UTF-16, IMHO, I would say "2 bytes per code unit" since a code point outside the BMP will be encoded in surrogate pairs as 2 code units (4 bytes). – Ludovic Kuty Dec 14 '15 at 14:04
  • 1
    Misses the differences between UTF-16LE (within .NET) and BE as well as the notion of the BOM. – Maarten Bodewes Apr 20 '16 at 15:47
  • The U in UTF stands for Unicode. UTF stands for Unicode Transformation Format, so all UTF is some type (encoding) of unicode. – Nick Sotiros Aug 25 '16 at 14:50
  • Is there any difference between an ASCII and a WP-1252 encoded file if only ASCII chars are present? Once extended chars are introduced into the file that can't be displayed in ASCII, is a BOM added to the file to clearly identify it as WP-1252, or is just the MSB of extended chars relied on for identification? – Andrew Jan 03 '18 at 21:37
  • 2
    @Andrew: No, there's no (general) encoding marker. Windows 1252 can't represent the Unicode BOM, and it wouldn't make sense as it's only a one-byte-per-char encoding anyway. – Jon Skeet Jan 03 '18 at 21:39
  • @JonSkeet : I think it is time to correct the comment that UTF-16 characters outside the BMP are "relatively rarely used" ... Thanks to the #Emojiplosion of recent years, we all need to get savvy how to deal with "multi-word" UTF-16! – MrWatson Sep 06 '19 at 09:15
  • @MrWatson: Yup, will do. – Jon Skeet Sep 06 '19 at 09:15
  • @JonSkeet - you get +100 from me for the enlightening comment about ANSI ... This term has been confounding me for YEARS and YEARS and multiple searches in the internet have not enlightened me - till now! – MrWatson Sep 06 '19 at 09:17
73

Some reading to get you started on character encodings: Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

By the way - ASP.NET has nothing to do with it. Encodings are universal.

Tomalak
  • 306,836
  • 62
  • 485
  • 598
  • 12
    Answered here 6 years after the article was written. I read it 8 years after the post was written. 14 years later and it's still a good read. That's more than half my life ago. Incredible. – Dave Knise Aug 01 '17 at 23:42
  • Another similar useful resource: https://www.youtube.com/watch?v=MijmeoH9LT4 – vulcan raven Dec 15 '20 at 08:14