3

I know that strings are stored in Unicode format. I have also heard that strings are ALWAYS Little Endian Unicode, even if the system is Big Endian. My question is this:

Are Strings represented in Big Endian Unicode IF the system is also Big Endian?

By the way, I'm using this for a small performance speedup when writing to a file that needs to be Little Endian Unicode.

ABCD Man
  • 69
  • 4
  • Could you please clarify what you want to know?? – tbhaxor Sep 06 '20 at 18:22
  • 1
    My question is are strings stored internally as Big Endian Unicode IF the SYSTEM is also big endian? – ABCD Man Sep 06 '20 at 18:22
  • it should be the same. what makes you concerned about this? – Daniel A. White Sep 06 '20 at 18:25
  • 3
    this sound like an [XY problem](http://xyproblem.info/): why are you interested in how .Net store the string (in memory, I assume)? unless you are planning to play with raw memory and unsafe code, this is not something you should worry about. or is it just plain (good) curiosity? – Gian Paolo Sep 06 '20 at 18:27
  • @GianPaolo I've updated my question. – ABCD Man Sep 06 '20 at 18:29
  • 1
    I don't know what Byte order use internally .NET for storing strings. Anyway for your case, I don't think you should know it to speed your code unless you measure and realize you have a real and meanigful performance problem. You should eventually use unsafe code, and in every case I doutb you will be able to beat .Net [UnicodeEncoding](https://docs.microsoft.com/en-US/dotnet/api/system.text.unicodeencoding) class performance – Gian Paolo Sep 06 '20 at 18:52
  • Most text files applying the Unicode standard use the UTF-8 encoding. In UTF-8, endianness is not relevant. Endianness matters for UTF-16 and UTF-32. Which one are you using? – Codo Sep 06 '20 at 19:08
  • 1
    If you want to achieve a performance speedup by matching the encoding, you must have an in-depth understanding of .NET strings and you must have already put hours into optimizing everything else. If not, you gain much more by optimizing other things. – Codo Sep 06 '20 at 19:13

1 Answers1

4

The CLI specification says:

I II.1.1.3 Character data type

A CLI char type occupies 2 bytes in memory and represents a Unicode code unit using UTF-16 encoding.

There is no requirement that it be in a particular byte-order. And there are good reasons to expect that the byte order would match the byte order for other numeric types for the current architecture. I.e. on a big-endian machine, one would expect the char type to be stored as big-endian 16-bit values.

While it's not an authoritative document, I'll note that several people who have answered or commented on How do I get a consistent byte representation of strings in C# without manually specifying an encoding? share this belief, i.e. that endianness of the char type depends on the platform architecture. There are several statements in the comments and answers to that question that claim that char is big-endian on big-endian systems.

It seems to me that if the endianness of your architecture is important, you would have access to a CLI implementation for a big-endian architecture and would be able to easily verify for yourself the byte order used for the char type. Have you made any effort to do such a verification?

All that said, it is very likely that you do not need to know the byte ordering for the char type. .NET provides character encoders for a wide variety of encodings, including both UTF16-LE and UTF16-BE. When using the char type itself, the byte ordering is irrelevant, and in situations where the byte ordering matters, you can force a specific ordering by using the appropriate Encoding type. If you feel that you have a situation which you believe is an exception to these general guidelines, it would be better to post a question describing exactly what that situation is and why you believe it's an exception to the general guidelines.

Peter Duniho
  • 62,751
  • 5
  • 84
  • 120