2265

How do I convert a string to a byte[] in .NET (C#) without manually specifying a specific encoding?

I'm going to encrypt the string. I can encrypt it without converting, but I'd still like to know why encoding comes to play here.

Also, why should encoding even be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?

iliketocode
  • 6,652
  • 4
  • 41
  • 57
Agnel Kurian
  • 53,593
  • 39
  • 135
  • 210
  • 2
    Your confusion over the role of encoding makes me wonder if this is the right question. Why are you trying to convert a string to a byte array? What are you going to do with the byte array? – Greg D Jan 23 '09 at 13:56
  • I'm going to encrypt it. I can encrypt it without converting but I'd still like to know why encoding comes to play here. Just give me the bytes is what I say. – Agnel Kurian Jan 23 '09 at 13:57
  • 5
    If you're encrypting it, then you'll still have to know what the encoding is after you decrypt it so that you know how to reinterpret those bytes back into a string. – Greg D Jan 23 '09 at 14:00
  • 26
    Every string is stored as an array of bytes right? Why can't I simply have those bytes? – Agnel Kurian Jan 23 '09 at 14:05
  • 3
    I think Anthony is trying to address the fundamental disconnect in <300 chars. You're assuming some consistent internal representation of a string, when in fact that representation could be anything. To create, and eventually decode, the bytestream, you must choose an encoding to use. – Greg D Jan 23 '09 at 16:38
  • 2
    "A string is an array of chars, where a char is not a byte in the .Net world" Alright, but regardless of the encoding, each character maps to one or more bytes. Can I have those bytes please without having to specify an encoding? – Agnel Kurian Mar 04 '09 at 05:51
  • 142
    The encoding *is* what maps the characters to the bytes. For example, in ASCII, the letter 'A' maps to the number 65. In a different encoding, it might not be the same. The high-level approach to strings taken in the .NET framework makes this largely irrelevant, though (except in this case). – Lucas Jones Apr 13 '09 at 14:13
  • 2
    You can take the easy route and just use UTF-8 on both sides. – Lucas Jones Apr 13 '09 at 14:14
  • 6
    In case of .NET, the easy route is using UTF-16 on both sides, since that's what .NET uses internally. – Alexey Romanov Jul 22 '09 at 11:30
  • 21
    To play devil's advocate: If you wanted to get the bytes of an in-memory string (as .NET uses them) and manipulate them somehow (i.e. CRC32), and NEVER EVER wanted to decode it back into the original string...it isn't straight forward why you'd care about encodings or how you choose which one to use. – Greg Dec 01 '09 at 19:47
  • 84
    Surprised no-one has given this link yet: http://www.joelonsoftware.com/articles/Unicode.html – Bevan Jun 29 '10 at 02:57
  • 1
    @Bevan: dated January 23 2009, you come late to the party ;-) http://stackoverflow.com/questions/472906/net-string-to-byte-array-c/472986#472986 – Michael Buen Jul 09 '10 at 00:08
  • possible duplicate of [How do you convert a string to a byte array in .Net](http://stackoverflow.com/questions/241405/how-do-you-convert-a-string-to-a-byte-array-in-net) – adamjcooper Jul 06 '13 at 11:47
  • 8
    @AgnelKurian, A `char` is a `struct` that *just happens* to **currently** store values as a 16-bit number (UTF-16). What you're really asking (get the character bytes) isn't theoretically possible because it doesn't theoretically exist. A `char` or `string` has no Encoding by definition. What if the memory representation changed to UTF-32? Your "get the bytes, shove them back" would fail **due** to Encoding *because you avoided Encoding*. So "Why this dependency on encoding?!!!" **Depend on Encoding so your code is dependable.** – Travis Watson Aug 05 '13 at 22:04
  • 2
    Have a look at Jon Skeet's [answer](http://stackoverflow.com/questions/241405/how-do-you-convert-a-string-to-a-byte-array-in-net#241466) in a post with the [exact question](http://stackoverflow.com/questions/241405/how-do-you-convert-a-string-to-a-byte-array-in-net). It will explain why you depend on encoding. – Igal Tabachnik Jan 23 '09 at 14:15
  • 31
    A char is not a byte and a byte is not a char. A char is both a key into a font table and a lexical tradition. A string is a sequence of chars. (A words, paragraphs, sentences, and titles also have their own lexical traditions that justify their own type definitions -- but I digress). Like integers, floating point numbers, and everything else, chars are encoded into bytes. There was a time when the encoding was simple one to one: ASCII. However, to accommodate all of human symbology, the 256 permutations of a byte were insufficient and encodings were devised to selectively use more bytes. – George Aug 28 '14 at 15:43
  • @usr: you just invalidated almost all the answers with your edit, and also made it harder for people to find this question with their natural search query (but you probably did that intentionally). – user541686 Nov 03 '14 at 21:37
  • @Mehrdad the existing answers were already invalid (not what was asked). Yours is pretty much the only answer that actually answers just what was asked. (I recommend, though, that you edit your answer to include a few warnings that this approach is really almost never the best one.) – usr Nov 03 '14 at 21:50
  • 7
    Four years later, I stand by my original comment on this question. It's fundamentally flawed because the fact that we're talking about a string _implies interpretation_. The encoding of that string is an implicit part of the serialized contract, otherwise it's just a bunch of meaningless bits. If you want meaningless bits, why generate them from a string at all? Just write a bunch of 0's and be done with it. – Greg D Dec 12 '14 at 22:44
  • @Greg D, Let's say my client has some floating point numbers in some exotic format used to store astronomical distances. He uses just that one format. He wants me to take care of writing and reading those numbers. I am not interpreting them. My client interprets the numbers and all he needs to give me are the bytes I need to write. When reading, all he needs from me are the bytes I have written. Storing a format flag each time in addition to the bytes is a waste of space when he is using just one format for all numbers. – Agnel Kurian Dec 13 '14 at 03:36
  • 3
    @Agnel Kurian: If you're writing arbitrary binary data, write binary data. That has nothing to do with the original question (which is fundamentally about serializing a string). – Greg D Dec 15 '14 at 18:28
  • @GregD so you want to store the same encoding 1000 times for 1000 different strings? – Agnel Kurian Dec 17 '14 at 02:42
  • 6
    @AgnelKurian: Are you trolling me? That question doesn't make sense. I could infer that you meant something like, "...store information about the encoding that was used 1000 times for 1000 different string." Nobody ever said anything about doing that, though, and it was explicitly denied earlier when I stated "The encoding of that string is an _implicit_ part of the serialized contract..." so you couldn't have meant that. – Greg D Dec 17 '14 at 21:23
  • 1
    @AgnelKurian "He wants me to take care of writing and reading those numbers. I am not interpreting them." - If you weren't interpreting them, you'd have bytes and not "numbers". Then, your question disappears. If you have "numbers", that means you've already interpreted/decoded them and threw away the original byte data. And now you want to try and reconstruct the data (encode) which might not be even possible. What it the numbers were actually base-10 and by cramming them into base-2 floats, you've destroyed them forever? Don't want to encode? Don't decode then. Want bytes? Then use bytes. – Ark-kun Apr 20 '17 at 08:36
  • 1
    Are you assuming that `System.Text.Encoding.Unicode.GetBytes(); ` is doing some kind of expensive conversion that you want to avoid? If so, your assumption is wrong. – Kris Vandermotten Apr 28 '17 at 13:59
  • 3
    Your first comment (quote): _Every string is stored as an array of bytes right? Why can't I simply have those bytes?_ No, every string is (more or less) stored as an array of 16-bit ___code units___ which correspond to UTF-16. There will be surrogate pairs in there if your string contains Unicode characters outside plane 0. You can get that representation easily: `var array1 = yourString.ToCharArray();` If for some reason you want the code units as `UInt16` values, do `var array2 = Array.ConvertAll(array1, x => x);`. That is a `ushort[]` there. – Jeppe Stig Nielsen Jul 24 '17 at 09:36
  • Encoding is **necessary** because the size - in bytes - of the represented characters depends on it, and not only because sizeof(char) is different for i.e. ASCII (1 byte) and WideString(2 bytes), but because it can even _vary_ - in case of UTF-8 a character is represented as _1 to 4 bytes_ – mg30rg Dec 05 '17 at 16:23
  • 3
    Not worrying about encoding is one thing. Not wanting to specify an encoding is an entirely another thing. If what brings you frustration is what encoding you should use, just pick one and use it all the times for conversions between string to byte array and byte array to string. For instance, always use Unicode, or UTF-8. Your choice. After you have chosen an Encoding, you need not to worry any more and your problem is solved. But if your frustration comes from the need to specify an encoding then you better get used to it, because either you like it or not, an encoding is taking place. – Thanasis Ioannidis Jun 27 '18 at 11:16
  • 3
    You should always worry about what encoding your string is represented in the byte array. The assumption that the string is represented in-memory with a byte array is arbitrary. It happens to be like that in the present implementation of .net. No one can guarantee you it won't change to a linked-list implementation in the future (or any other exotic data structure). Even if you use the same system and the same program to read back the encrypted data, there is always a chance a future patch of .net will break everything apart because you didn't explicity specify in what Encoding you work – Thanasis Ioannidis Jun 27 '18 at 11:21

40 Answers40

1902

Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!

Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".
(And, of course, to be able to re-construct the string from the bytes.)

For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.

Just do this instead:

static byte[] GetBytes(string str)
{
    byte[] bytes = new byte[str.Length * sizeof(char)];
    System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

// Do NOT use on arbitrary bytes; only use on GetBytes's output on the SAME system
static string GetString(byte[] bytes)
{
    char[] chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.

Additional benefit to this approach: It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!

It will be encoded and decoded just the same, because you are just looking at the bytes.

If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.

Robert Harvey
  • 168,684
  • 43
  • 314
  • 475
user541686
  • 189,354
  • 112
  • 476
  • 821
  • 5
    +1 Exactly my thoughts, I don't know the insistence of some peeps here about encoding. Just need to do a memory dump / serialization(the default serialization library from Microsoft has flaws though). I hope I know this BlockCopy API before :-) – Michael Buen Apr 30 '12 at 09:11
  • 3
    @MichaelBuen: Yup. As long as your memory dumps/serializations do *not* try to interpret the data, it's all fine. The rule of thumb to remember is this: If your program (or a different program) needs to convert the output of `GetBytes` back to the same string, it may *only* use `GetString` to do this. As long as you don't violate that, you can ignore the concept of encodings entirely. – user541686 Apr 30 '12 at 09:20
  • @Mehrdad I agree with your logic, but I was surprised when I tested it that the encoding method is slightly faster. I guess I was expecting your method to be faster (there isn't much in it though) – Ian1971 May 11 '12 at 11:16
  • 1
    @Ian1971: Might it be because `ToCharArray()` allocates a new array, which gets subsequently discarded? – user541686 May 11 '12 at 13:29
  • 1
    @Ian1971 Encoding methods has its pitfalls though, it can't preserve the image copy of the original string; in particular, high surrogate characters can't be preserve with encoding method. Check this test: http://stackoverflow.com/a/10384024 – Michael Buen May 13 '12 at 11:06
  • 259
    What's ugly about this one is, that `GetString` and `GetBytes` need to executed on a system with the same endianness to work. So you can't use this to get bytes you want to turn into a string elsewhere. So I have a hard time to come up with a situations where I'd want to use this. – CodesInChaos May 13 '12 at 11:14
  • 1
    @CodeInChaos just prefix a BOM before those bytes to indicate it came from .NET world(i.e. UTF-16) then http://mindprod.com/jgloss/utf.html – Michael Buen May 13 '12 at 11:25
  • 72
    @CodeInChaos: Like I said, the whole point of this is if you want to use it on the same kind of system, with the same set of functions. If not, then you shouldn't use it. – user541686 May 13 '12 at 18:00
  • 205
    -1 I guarantee that someone (who doesn't understand bytes vs characters) is going to want to convert their string into a byte array, they will google it and read this answer, and they will do the wrong thing, because in almost all cases, the encoding *IS* relevant. – artbristol Jun 15 '12 at 11:07
  • 415
    @artbristol: If they can't be bothered to read the answer (or the other answers...), then I'm sorry, then there's no better way for me to communicate with them. I generally opt for answering the OP rather than trying to guess what others might do with my answer -- the OP has the right to know, and just because someone might abuse a knife doesn't mean we need to hide all knives in the world for ourselves. Though if you disagree that's fine too. – user541686 Jun 15 '12 at 14:04
  • 12
    The question was asked 3 years ago, and is totally ambiguous. You have no evidence of how OP was going to use the bytes. Other people will have the *exact same question*, but will be planning to use the bytes in a situation where encoding matters, and your answer will be dead wrong in that case. – artbristol Jun 15 '12 at 14:25
  • 36
    Well, the way I think about it is: I'm not a judge. I don't ask for "evidence" from the OP to try to prove his case before I answer him (contrary to what others might try to do). He *clearly* said, "Can't I simply get what bytes the string has been stored in? Why this dependency on encoding?", to which my answer is 100% accurate, more than the others on this page IMO. And IMO he's understood the caveats by now. Also, the fact that the answer was from 3 years ago is irrelevant. But again, if you'd rather ask for "evidence" first, then that's your style, and feel free to keep the downvote.. – user541686 Jun 15 '12 at 14:32
  • 194
    This answer is wrong on so many levels but foremost because of it's decleration "you DON'T need to worry about encoding!". The 2 methods, GetBytes and GetString are superfluous in as much as they are merely re-implementations of what Encoding.Unicode.GetBytes() and Encoding.Unicode.GetString() already do. The statement "As long as your program (or other programs) don't try to interpret the bytes" is also fundamentally flawed as implicitly they mean the bytes should be interpreted as Unicode. – David Jul 11 '12 at 12:36
  • 12
    @David: *"...as implicitly they mean the bytes should be interpreted"* I have no idea how you read the answer, but it "implicitly" means that they could be any encoding whatsoever. Also, if you think the methods are "merely reimplementations" of `Encoding.Unicode` just because they do the same thing, then it seems like you're not understanding the abstraction layers correctly. – user541686 Jul 11 '12 at 15:04
  • 6
    @Mehrdad _...it "implicitly" means that they could be any encoding whatsoever"_ I don't understand this statement, what exactly do you mean by this? As far as I can see, _your_ `GetBytes()` method will return a Unicode encoded byte array of a string and _your_ `GetString()` method will (if you pass a Unicode encoded byte array representation of a string) return a readable string and in any other encoding return garbage. Worse than that though `GetString()` will crash if you pass it a UTF-8 encoded byte array of a string that contains an odd number of characters. – David Jul 11 '12 at 15:42
  • 24
    @David: Yes, it crashes on UTF-8 data, because `GetBytes` never happens to return UTF-8 data. It seems like the abstraction layer you're expecting is different from the one that's actually there. If you're not sure how to use it correctly then don't; the answer probably wasn't intended for your use case. However, I 100% stand by my answer that it is correct *for the use it was intended*, which I tried to make perfectly clear. – user541686 Jul 11 '12 at 16:08
  • 13
    @Mehrdad: Then we've come full circle then. `GetBytes` and `GetString` are re-implementations of `Encoding.Unicode.GetBytes()\GetString()`. You are reframing your argument to side step your initial assertion of _"any encoding whatsoever"_. I'm not disputing the code you've provided the OP wont work (for unicode at least) I just dont think it furthers his understanding of Encoding which he IS using however you try and hide it. – David Jul 11 '12 at 16:33
  • 18
    @David: Sigh, yes, they *happen* to be reimplementations, **but that is irrelevant at this abstraction level**. If you're even *caring* about that fact then *you're using it wrong*. If you don't know what I mean then please don't use it, but it's 100% valid for the OP's use case/abstraction level. – user541686 Jul 11 '12 at 16:36
  • 13
    I just need bytes for my crypto to work, i think you answer still rocks! – k.c. Oct 29 '12 at 10:27
  • 38
    -1 for the answer. +1 for David's and artbristol's comments above. Of course there is an in-memory representation of strings in .NET. It happens to be little endian UTF-16. When you get the byte array, you are getting them in *that* encoding. If all you *ever* want to do is convert from the byte array back to a string, the answer will suffice. But the answer is limited and dangerous. For example, if the bytes are to be included in an HTTP request, you need to know the encoding for the overall request. If you are in the business of converting characters to bytes, you *must* understand encoding. – Concrete Gannet Feb 25 '13 at 01:05
  • 14
    -1 for the answer, +1 for David's, artbristol's and Concrete's comments... This answer does NOT mention in any way that it only works if you execute both methods on the same platform. Furthermore it adds no value. The answer's argument is to provide a simple answer to a simple question but the answer is way more complicated than simply using the `Encoding.Unicode`. You don't need to worry about encoding either if you simply use those methods, but they are safe no matter what platform you run them on. – chiccodoro Jul 17 '13 at 07:54
  • 22
    @ConcreteGannet: I'm glad we both agree that *"If all you ever want to do is convert from the byte array back to a string, the answer will suffice."* That was the entire point of my answer. – user541686 Aug 01 '13 at 08:15
  • 16
    @chiccodoro: Safety isn't the only concern here. On your (hypothetical?) system where UTF-16 isn't the internal representation, `Encoding.Unicode` would be slower, with no benefit for the use cases this was intended for (which the OP has understood). Furthermore **safety is only a a problem if you don't know what you're doing**. You don't see C programmers avoiding pointers, despite of how "dangerous" they are, do you? You also don't see construction workers avoiding electric saws and drills. Just because you think something is dangerous doesn't mean people don't have a right to know about it. – user541686 Aug 01 '13 at 08:26
  • 2
    @Mehrdad: After some questioning, the OP says they intend to encrypt the string. In all likelihood, the next step after converting to a byte array would be some form of output. Whether your answer is correct or not depends on what is reading those encrypted bytes. The OP did *not* say a .NET application will read the encrypted bytes. If anything else is to read it, the OP should ensure the encoding is as expected by the reader. If the string is large and contains only or mostly plain ASCII, UTF-8 would be more compact, quicker to encrypt and quicker to output. – Concrete Gannet Aug 04 '13 at 08:30
  • 13
    Asking for the bytes of a `string` in .NET is akin to asking for the bytes of a `object`. The purpose of the `string` and `char` types is that the implementation details are abstracted. By using this answer, you are haphazardly circumventing implementation details and will be left with a fragile solution similar to binary serialization. There is no reason to use this answer, since using encoding is more robust, more portable, more logical, and most importantly *easier*. Seriously, the Encoding answers are one-liners... why do something crazy like this?! – Travis Watson Aug 05 '13 at 21:36
  • 4
    @Travis: Except it's **not** the same thing as asking for the bytes of an `object`: .NET specifically prevents you from doing that, but doesn't prevent you from doing this. That by itself should be enough to tell you there's a difference. – user541686 Aug 05 '13 at 21:59
  • 8
    @Mehrdad, digressing, you need to realize that technically possible does not equal pragmatically relevant nor architecturally sound. Back on topic, whether or not you realize it, you're effectively performing `System.Text.Encoding.Unicode.GetBytes(str)` because that's what .NET is doing to represent the `string` in memory. People are saying you don't understand Encoding because *they know* you can't avoid it. The *only* thing you're doing is jumping through hoops **to hide it!** Do you honestly still think this is a good idea? – Travis Watson Aug 05 '13 at 22:40
  • 2
    @Mehrdad, on a second read I noticed you ignored the entirety of my comment. The one part you did respond to, you misinterpreted (akin != same). I'm really starting to question why you're vehemently promoting this obviously flawed answer. – Travis Watson Aug 05 '13 at 22:59
  • 10
    @Travis: I read your entire comment, but the entire basis of it was wrong (your claim that it's akin to reading the bytes of an `object`). There's **nothing** similar between the two. What I'm telling you is that this code is meant for **a different abstraction level than you think**. Saying *"it's just like `Encoding.Unicode.GetBytes`"* is wrong since it ***breaks* that abstraction barrier**. I don't know what else to tell you. My answer already served its purpose, which was to directly answer the OP's question. If you don't like my answer then downvote it; that's what it's for! – user541686 Aug 06 '13 at 00:29
  • 4
    @Travis: The last thing I will tell you (because I just noticed it right now) is to read [**this answer below**](http://stackoverflow.com/a/10384024/541686). I had already mentioned this before, but since that answer actually demonstrates it, I'll say it again: my answer saves & restores the string *perfectly*; encoding-based methods fail to work on `char` sequences that can't be represented correctly. – user541686 Aug 06 '13 at 00:41
  • 1
    Isn't it that, because .NET uses UTF-16 internally and 16 bit characters therefore, a string in this example is in fact encoded using UTF-16? If you use Encoding.Unicode.GetString(), which is UTF-16, on the byte-array created in this example, it produces the original string value. – Kai Hartmann Aug 09 '13 at 22:31
  • 6
    Yes, this answer works for niche use cases. But other answers work for all use cases. Why not use the superior (and just as easy to use... and less error prone) techniques that require you to type in an encoding? Giving this a big fat -1 because of that. – Thomas Eding Sep 27 '13 at 18:16
  • 7
    @Thomas: No, the other answers don't work for all use cases. Did you read Michael Buen's answer? His answer tells you why mine can handle cases that none of the other answers can. *None* of the answers here handles all cases, but mine handles the relevant cases to the OP. – user541686 Sep 27 '13 at 22:44
  • @Mehrdad: Fair enough. But I still don't like this solution. Not sure completely about his (I don't feel like learning about unpaired surrogates at the moment, but it seems like it is something along the lines of a trap representation). – Thomas Eding Sep 27 '13 at 22:56
  • 8
    @Thomas: I don't really care if you "like" the solution (heck, I don't particularly either), but you can't deny that it's the only correct answer given here for the OP's use case (conversion between strings and byte arrays). The other answers destroy some `char` sequences in the process, mine doesn't. Keep your downvote, but please think twice before hopping on the bandwagon and spreading misinformation. – user541686 Sep 27 '13 at 23:01
  • 4
    @Mehrdad: `How do I convert a string to a byte array in .NET (C#)?` is the OP's described use case. Literally any answer that returns a `byte[]` would be technically correct. But I'm done with this extended chat. – Thomas Eding Sep 27 '13 at 23:03
  • 10
    *"Worrying about encodings just makes your life more complicated for no real reason."* - Er, except that the answers that *do* worry about encoding are much simpler than this one. And of course, this answer still **does** rely on a particular encoding - `str.ToCharArray()` must rely on an encoding, even if that encoding is not explicitly mentioned in the code *(which can only be considered bad)*. I respect you a lot, Mehrdad, but this is a terrible answer. – BlueRaja - Danny Pflughoeft Oct 01 '13 at 19:16
  • 1
    @BlueRaja-DannyPflughoeft: Read my comments above. The abstraction layer we concern ourselves with here (i.e. the need for perfect 1:1 reconstruction on a given system) isn't the same as when you worry about encodings (i.e. interoperability with another system). They're two completely unrelated concerns and *the former has nothing to do with encoding* (and in fact ***cannot*** be done with any encoding scheme here). – user541686 Oct 01 '13 at 19:47
  • 3
    This does not keep the encoding intact. Too bad this is the accepted answer with the highest votes because I just wasted 2 hours trying to find out why my strings get garbled. Chased it down to a method that used this answer to convert string -> byte[]. – user1151923 Oct 09 '13 at 12:26
  • @user1151923: Can you show me an example of an actual string that gets garbled? I can't fix the answer if you don't tell me how to reproduce the problem... – user541686 Oct 09 '13 at 19:29
  • 1
    `var input = "тхис ис а тест"; var ms = new MemoryStream(GetBytes(input)); var sr = new StreamReader(ms); var output = sr.ReadToEnd();` output is B5AB – user1151923 Oct 10 '13 at 13:43
  • 1
    I would add that I don't think "For those goals" is a justification to an answer that [sometimes] messes up encoding. What people are gonna see when they open this question is the question (.NET String to byte Array C#) and a highly rated answer claiming you don't need to worry about encoding in bold text (which by the way is missing the "for those goals" part). There are answers below that are shorter or just as long and that keep the encoding intact regardless of where and how you use the code. – user1151923 Oct 10 '13 at 13:49
  • 15
    @user1151923: Dude, the problem is with **your** code, not my answer! You're using `GetBytes` to convert a string to bytes, but you're not using `GetString` to go the reverse direction! These are supposed to be used **in pairs**; you can't just do whatever you feel like and expect it to work. If you don't use encodings one way you **also** have to ignore them in the reverse direction, but you ignored the fact that `StreamReader` is encoding-based! Read my comment earlier: http://stackoverflow.com/questions/472906/net-string-to-byte-array-c-sharp/10380166?noredirect=1#comment13383434_10380166 – user541686 Oct 10 '13 at 18:11
  • 4
    @user1151923: And before you blame me for not warning you, realize that what has happened in your code is *exactly* equivalent to using `new StreamReader(stream).ReadToEnd()` to go in one direction, but using `Encoding.UTF8` to go in the other direction. It's wrong because the writer was careless, and it has nothing to do with the answer that might have told you to use `UTF8`. If the fact that `StreamReader` uses UTF-16 by default is confusing, don't blame it on my answer; it's not my fault it was designed that way. – user541686 Oct 10 '13 at 18:26
  • 10
    @Mehrad Just because your answer is technically correct in this case doesn't make it a good answer for the reasons stated by many before me. It's like recommending the `goto` statement when better alternatives are available because "well it works in this case doesn't it?". This site is meant for answers that will function properly for as many usecases as possible within the scope of the question. You announcing "YOU DONT NEED ENCODING" in a big size at the top of your answer while leaving the main caveat as a little side note at the bottom could lead to problems. – Leon Lucardie Oct 23 '13 at 10:48
  • 18
    @LeonLucardie: The other alternatives aren't "better"; *in fact, they're **worse*** because they break on strings that can't be encoded correctly (such as those that contain unpaired surrogates). I've mentioned this a million times now, but apparently it's very convenient for people to ignore this fact... – user541686 Oct 23 '13 at 10:51
  • 7
    @Mehrdad Even in a perfect world where people would act professional, they won't take the time to do a little research. I'm *almost* positive that all pros and cons of this solution have been addressed in the comments here, as well as in the other answers. If there are still those who won't realize this fact and feel that continuing to argue (even 1.25 years later) over points that have already been addressed then it's not worth your time nor anyone else's to argue further. There are answers here that apply to both 'need-encoding' and 'don't-need-encoding' use cases; it's as simple as that. – Chris Cirefice Nov 04 '13 at 04:37
  • 2
    +1, but, wouldn't `str.SelectMany(BitConvertor.GetBytes).ToArray();` suffice. (yes, I suspect `BlockCopy` is faster.) – Jodrell Apr 07 '14 at 08:43
  • 3
    @Jodrell: You just answered yourself. And plus, it requires .NET 3.5 which should not be necessary. – user541686 Apr 07 '14 at 09:48
  • 4
    this is one of the worst pieces of code I've seen. And I saw people using DataTables in .NET 4! Neither the questioneer nor the person that posted this answer seem to understand what encoding actually means. Of course you are using encoding with this answer...but you don't know which encoding! Even if you are converting stuff on the same machine, who tells you that the user won't change his encoding, rendering the bytes unreadable?! – Steffen Winkler May 30 '14 at 13:24
  • 5
    @SteffenWinkler: Yes, the answer does use *an* encoding but the point is that it doesn't care what. The reason is that it is guaranteed to be using the same encoding both ways. I'm not sure how you think that a user can change the encoding because this is the encoding that .NET uses to store strings. I don't believe a user could change it. If the runtime was changed then you'd be restarting the program so again both methods would be using the same encoding still. – Chris Jun 11 '14 at 09:15
  • This won't compile for me; the first line in the GetBytes() method fails with, "C:\Project\sscs\Handheld\Releases\6-4-0\HHS\PrintUtils.cs(752): sizeof can only be used in an unsafe context (consider using System.Runtime.InteropServices.Marshal.SizeOf)" – B. Clay Shannon Jun 11 '14 at 18:38
  • @B.ClayShannon: Are you on an old version of .NET? Just replace `sizeof(char)` with `2`. – user541686 Jun 11 '14 at 18:39
  • @Mehrdad: Yes, older than Rip Van Winkle. Just to use it, I have to use XP Mode and VS 2003; and yep, that allowed it to compile. – B. Clay Shannon Jun 11 '14 at 18:50
  • 23
    **-1** It's scary how this is **the accepted and highest voted answer**. Yes, it may be useful to get the bytes of a string in the way it's stored on the memory. Yes, it may not matter the fact that it fails if `GetString` and `GetBytes` are called on machines with different endianness. But saying "you DON'T need to worry about encoding!" is so terrifyingly **evil**, since you encourage people to ignore [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know](http://www.joelonsoftware.com/articles/Unicode.html). @artbristol is right: The encoding _IS_ relevant. – Şafak Gür Jul 23 '14 at 07:44
  • 6
    @ŞafakGür: Yes I do -- I *do* encourage people to ignore things that are *irrelevant* to the problem. What's *really* "evil" is teaching people to worry about the wrong thing. I believe the encoding is **irrelevant** to the question because the encoding is *on an **entirely different** abstraction level*. You obviously don't think so, so keep your downvote, and thanks for sharing your thoughts. – user541686 Jul 23 '14 at 07:50
  • 7
    Don't get me wrong, simplicity is good. But the OP asked a very general question. Is he going to convert to and from the string on the same architecture? Is he going to write the bytes to a file and expect it to be viewed using a specific text editor? He didn't state any of these. So anyone who come to this question may read "You certainly do NOT need to worry about encodings" and think that encoding is not relevant nor needed, _in any case_. So if you said "Use this if you'll decode the bytes on the same machine and you don't need a specific encoding", this could be a great answer. – Şafak Gür Jul 23 '14 at 08:10
  • 3
    @ŞafakGür: You have to understand, the approach you want me to put into my answer is *outright wrong* because it's not bijective between the strings and byte arrays -- e.g., it doesn't preserve unpaired surrogates. I've said this a million times now. If it at least *worked correctly*, I would have considered it. But it doesn't -- it breaks on any string that doesn't happen to fit a Unicode encoding. That's why I insist so much on avoiding encodings altogether: they're not only unnecessary, they're *blatantly wrong* and don't work on arbitrary strings. – user541686 Jul 23 '14 at 08:23
  • 3
    @artbristol: Well that's a new one. In all of your comments so far you never even once placed my understanding of the problem under question, and here you are now accusing me of being a brick wall who doesn't understand what String means in C#. For the record, I'm neither a brick wall nor a C programmer, which would have been fairly obvious if you looked at my badges in C vs. C# before pretending you knew me so well. – user541686 Jul 23 '14 at 09:54
  • this solution is not correct at all. The fatel error is that when bytes.Length is odd, the Length of chars is not enough to copy to, which rasie an ArgumentException says "Offset and length were out of bounds for the array or count is greater than the number of elements from index to the end of the source collection". We'd better use @bmotmans answer. – tandztc Sep 02 '14 at 09:28
  • 3
    @tandztc: No, I think you're the one who's not using it correctly. How do you get an odd `bytes.Length` in the first place? If you followed the answer correctly (which implies you're using `GetBytes` to get the `bytes`) then that event is impossible. If you got the byte array some other way then you have to convert it back to a string the same way instead of using this answer. – user541686 Sep 03 '14 at 00:37
  • @Mehrdad: Oh, sorry. I recognised that these two method must used in pairs. I misunderstood the usage because I'm searching for a solution of just convert byte array to string, so I leave a comment because the GetString method is not capable of handling all byte arrays. Sorry for bothering you again -:) – tandztc Sep 03 '14 at 04:37
  • 13
    -1 for the misleading declaration "you DON'T need to worry about encoding". This completely disregards the fact that algorithms mainly convert a string to a byte buffer because of some stream operation expects it. And when this serialization occurs, the encoding does matter either we serialize to a file or to a wire. Industry is throwing 1000s of workhours yearly because of encoding mismatch issues, the last thing we need is evangalizing "we do not need to worry about encoding..." – g.pickardou Oct 10 '14 at 14:11
  • @Mehrdad: Would there be any objection to starting with an even-length `Byte[]` and converting that to a string which can later be converted back to `Byte[]`? I would think byte conversion would allow various "linear" operations (e.g. concatenating two strings produced by the conversion would be equivalent to converting the concatenation of the two arrays), while most other encodings would not. The only disadvantage I see with "straight" conversion is that the lexiographic ordering of the `String` object would differ from that of the `byte[]` [fixing that would require byte-swapping pairs]. – supercat Nov 12 '14 at 22:26
  • @supercat: If you can guarantee it's even-length then no, but otherwise you'd lose the length information. – user541686 Nov 12 '14 at 22:48
  • 3
    @Mehrdad: Perhaps it would be good to make clear that your method is appropriate for the predictable serialization of `String` instances which might hold arbitrary binary data, as opposed to those which are known to hold valid UTF-16 strings. It's really too bad MS didn't include any other "immutable blob" types, since `String` gets used oftentimes when some other standard blob type would probably be more appropriate *if any existed*. – supercat Nov 12 '14 at 22:54
  • @Mehrdad: Also, do you know of any nice way to convert a byte array to a string, with the bytes paired *MSB-first*, and preferably interpreting an odd-length array as though it was zero-padded? Using `String.CompareOrdinal` on strings produced by converting the `KeyData` from a `SortKey` in such fashion will be faster than `SortKey.Compare`, but producing such strings is a little slow. – supercat Nov 13 '14 at 18:00
  • 1
    I think you make assumptions on how strings are stored in the CLR. How do you know it is actually represented by a contiguous sequence of bytes? It might be represented as a linked list, or something else. Don't make assumptions. It will bite you in the umptions. – Erik A. Brandstadmoen Nov 26 '14 at 11:40
  • 6
    @ErikA.Brandstadmoen: Two things: (1) If it was anything other than a contiguous sequence of bytes then you couldn't obtain a pointer to the data in constant time via `fixed (char* p = str) { ... }` (2) The reality is that this fact is actually 100% irrelevant because `ToCharArray` always returns a char array regardless of the underlying data format, which is all we need and care about. – user541686 Nov 26 '14 at 11:53
  • 1
    Of course, you are right, @Mehrdad. I read your answer too quickly. I thought you were charpointer-ing yourself into the string itself, which would of course just work if it is indeed represented by a contiguous byte array in memory. But, if you call `ToCharArray`, the implementation of string storage is irrelevant, of course (except for efficiency...). – Erik A. Brandstadmoen Nov 27 '14 at 21:26
  • 15
    What makes this answer so horrible is the presumption that the OP just wants to "get the bytes" for some ephemeral operation, and then follows up with comments harping on the fact that using an encoding will destroy the invalid string by removing the unpaired surrogates. This begs the question, **why is the data represented or stored as a string in the first place**? A string is designed to represent text, not some broken or illegal sequence of characters. (continued ...) – F.Buster Dec 17 '14 at 23:40
  • 15
    Of course this roundabout pair of methods is *technically* correct because it satisfies some imaginary specifications for the OP's overwhelmingly under-specified use-case, but there are certainly more correct solutions for what the OP is *actually* trying to accomplish. Since we may never know what that may be, this answer is not only incorrect, but actively harmful as both an answer to this question and also in general. – F.Buster Dec 17 '14 at 23:40
  • 6
    @F.Buster: *A string is designed to represent text, not some broken or illegal sequence of characters.*... you're jumping to conclusions. Just because the string might not be valid UTF-16 doesn't mean it's "broken" or "not text". It just means you can't assume the encoding is UTF-16, so the answer needs to be independent of whatever encoding the string may happen to be using. And it is. If you don't like the question then I'm sorry, but this **is** the correct answer to the question. – user541686 Dec 18 '14 at 00:25
  • 5
    @Mehrdad: _so the answer needs to be independent of whatever encoding the string may happen to be using_ <= This is conflating _representation_ with _abstraction_. A string, as just a string, is _already_ independent of whatever encoding the implementation is using under the covers. The very act of _any_ transcription of the string "Hello world" to some byte sequence is utilizing an encoding _by definition_. The only thing accomplished by plugging one's ears, shouting "LA LA LA!", and reinterpreting a block of memory as bytes is _hiding_ the encoding that happened to be used. – Greg D Dec 18 '14 at 01:47
  • I edited this answer by removing the repetitive defensiveness over it's correctness. I also move the *technical* explanations of why this is correct together at the start of the answer. I also changed the emphasis around a bit. I think this goes a long way to solving the flamewar over the answer. – Aardvark Jan 26 '15 at 17:06
  • 3
    Aardvark, your edit wasn't bad, but I didn't really see the point of it (and I noticed a little bit of a grammar/capitalization typo), so I rolled it back... I do think the original was good enough, and it's how I wanted to phrase things, and I'd rather it not be edited. I think the discussion did have a benefit and should stay, because (1) it helped the readers realize that this answer can controversial in a shared codebase, and (2) it allowed me to emphasize why I think the answer is the correct approach. Anyway, the discussion has rather ended already, so don't worry about it. – user541686 Jan 26 '15 at 18:21
  • 1
    This is precisely what I was looking for. I needed something that could send and receive events for observer patterns for a tech demo that just uses a simple Console App, and the event messages were being sent and received as byte arrays, so i figured one of the better ways to show the functionality was just to make the message a regular old string. This wouldn't be too helpful for most stuff, but it was exactly what I needed! Thanks a ton :) – kayleeFrye_onDeck Mar 16 '15 at 01:22
  • 7
    BEST ANSWER for my specific issue, Thank you ! Used this to track conversion glitches between encodings for diagnostic purposes, on the same machine, the same application, without network connexions. Just because most of us are afraid one will use this to serialize datas and use them across platforms/databases **is NOT a valid reason** to set this answer on fire. Used this specifically to avoid disastrous encoding results. That's why I like SO so much : you can get here answers for very specific and unusual tasks. For beginners about safe string-bytes conversion, go re-read MSDN. – Karl Stephen Mar 23 '15 at 08:50
  • 2
    I use this solution for converting password strings to `byte[]` before salting and hashing them. In this use case, I absolutely do not care about encoding _at all_. I don't even need to convert the resulting hash back to a string - for password validation I just directly compare the resulting hash `byte[]`s. Very elegant and low-overhead solution for this particular use case. The flame war here is a fun read, though. – chris Apr 10 '15 at 17:41
  • I can see this code crashing in such a simple case: `sizeof(char) == 2` `byteArray.Length == 9` Then, `(byteArray.Length / sizeof(char)) == 4`, The call to `BlockCopy` throws an Exception because you are going out of bounds. I would rather use up a bit more space and go for the easy solution of using Base64 Encoding from the `System.Convert` class. – Josep Apr 15 '15 at 13:54
  • 1
    @Josep How on earth would the length of the `byte[]` ever be an odd number when the `GetString` method is used the way it is intended here? Also keep in mind that this is just example code. In my not so uncommon use case that I described in my comment before (password hashing), converting back from `byte[]` to `string` isn't even necessary. – chris Apr 15 '15 at 18:51
  • 2
    @Josep: I'm glad your code crashes, because it's trying to tell you that you're using it wrong. Instead of trying to get around it, realize that this answer was only meant to solve a particular problem, which is different from yours, and hence you shouldn't be using it. – user541686 Apr 15 '15 at 18:54
  • 1
    This answer is wrong for when strings are not stored as UTF-16 or any fixed length encoding. Which means that Encoding *does* matter, even if it doesn't show up in the code. Because for UTF-8, you will introduce empty 'bytes'. This also assumes that a string's storage and GetBytes will return the same encoding -- If not, then you aren't returning the "String's bytes". Fortunately the OP just wants bytes, which this answer provides. – Gerard ONeill Aug 18 '15 at 16:24
  • 3
    Just use Encoding.Unicode.GetBytes(). The function posted in this answer is 2x slower than Unicode.GetBytes(). Tested in Release & x64 environment. – Vincent Aug 27 '15 at 06:21
  • If you don't know why encoding is important, you'd better hope you never have to deal with IBM's EBCDIC, whose characters don't match up to standard ASCII. – Powerlord Sep 19 '15 at 21:49
  • About that endianness remark. The only platform that is running .NET and is not little-endian is the Xbox360 and the XNA track (which was the main method of getting your .NET software on the Xbox360) has been discontinued. There are some variants of mono that do run on big-endian platforms but this is the exception rather than the rule. – John Leidegren Oct 25 '15 at 10:05
  • 1
    @JohnLeidegren Not true! Microsoft is porting the .Net Framework to Linux, and Linux runs on some big-endian architectures. See [here](http://docs.asp.net/en/latest/getting-started/installing-on-linux.html) for an example. – camerondm9 Nov 04 '15 at 04:16
  • 1
    @camerondm9 I'm not disputing the fact that these platforms exist but you have to consider that the CoreCLR does not JIT anything but X64 assembler (which is little-endian). To my knowledge Microsoft is currently not in the process of adding support for any other architecture, certainly not IBM PowerPC simply because there is no market for it. I'm not saying it can't happen I'm saying it's not happening any time soon. Disregarding everything I've said so far you still have to ask your self whether it is likely that you code will be running on a big-endian architecture in the near future? – John Leidegren Nov 04 '15 at 08:20
  • @JohnLeidegren Microsoft has a JIT engine for ARM, and ARM is bi-endian (implementation defined). It may not be highly likely, but if your code may run on mobile devices (or it's a library), you never know... – camerondm9 Nov 05 '15 at 03:13
  • 1
    @NumLock: Self-documentation. `sizeof(char)` isn't 1, it's 2. This is C#, not C. – user541686 Dec 01 '15 at 20:23
  • 5
    The reason this answer is wrong is that it is IMPOSSIBLE to map a sequence of glyphs to a sequence of bytes without encoding. It is, however, also true that this example works without direct use of Encoding objects. That's because it is secretly asserting the canonical encoding scheme for Strings -- Unicode 16 I beleive -- is correct for all decoding implementations. This is true for .NET but it wouldn't be for other languages or runtimes. It's important that users KNOW what they're doing here is exporting an (already encoded) internal representation, and not really avoiding encoding. – Matthew Mark Miller Jan 26 '16 at 18:48
  • 5
    >>don't try to interpret the bytes somehow<< Just viewing the bytes is a form of interpretation – AMissico Feb 10 '16 at 21:50
  • This code will do what was intended, but beyond that the theoretical arguments are mostly garbage, albeit in such a way that it wont matter (be observed) in practice. Don't forget that language and compiler are also abstractions (which seldom make hard guarantees about physical memory). The statement that the char-array _is_ the internal representation is reaching, as is citing pointer-code as proof. A string can be _observed_ as a char-array, and manipulating char-pointers can be _observed_ as you say, but could trivially be implemented as syntactic sugar for another physical representation. – AnorZaken Apr 10 '16 at 03:27
  • 1
    @chris **"low overhead"** this is more code and slower than Encoding.Unicode.GetString/Bytes; **"... for password validation I just directly compare the resulting hash"** this will fail if you compile this code for PC and Xbox360 to use the same password validation, since the hash will be different for the same passwort – Firo May 09 '16 at 09:57
  • @BlueRaja-DannyPflughoeft ToCharArray() does not rely on encoding, it is in the .Net source just a copy of the internal representation of the internal bytes of the string, thus getting the char array using ToCharArray() has the same effect as fixing a pointer to the private member m_firstChar of string – yoel halb May 26 '16 at 16:36
  • 1
    @yoelhalb: You cannot convert a string to a byte array without relying on a particular encoding, literally be definition. In this case, you're using the encoding used by "the internal representation [..] of the string". – BlueRaja - Danny Pflughoeft May 26 '16 at 16:44
  • @yoelhalb: yes it does. Of course it does. Not only because of what Danny said, but also because the API doc specifically says: "Copies the characters in this instance to a Unicode character array." The internal representation happens to *be* Unicode (UTF-16), but that's an irrelevant implementation detail. – Sören Kuklau Jun 08 '16 at 17:01
  • 1
    @BlueRaja-DannyPflughoeft: It just hit me that you (and a lot of others here) are having a grammar issue. Notice I wrote "you don't need to worry about encoding" and yoel said it "does not rely on encoding". There was no article preceding "encoding"! yoel did **NOT** say it doesn't rely on "*an* encoding". We only said *you do not have to worry about encoding anything to extract the bytes*. You seem to think we're claiming the string somehow doesn't already possess an encoding, which is obviously nuts, and not what we're saying. We're just saying encoding (as a **verb**) shouldn't occur here. – user541686 Nov 08 '16 at 11:48
  • 3
    No, there is no confusion. Your answer correctly states that you don't need to worry about encoding if you don't plan to interpret the string, but there are exactly 0 cases where this could possibly be useful. Even your own suggestion _("reconstruct the string")_ relies on the internal encoding of the string not changing. Meanwhile beginners see this answer and falsely believe they don't have to worry about what encodings are. This answer is worse than wrong, because it's technically correct but extremely detrimental. – BlueRaja - Danny Pflughoeft Nov 08 '16 at 16:28
  • 2
    @BlueRaja-DannyPflughoeft: *"here are exactly 0 cases where this could possibly be useful."* I already explained that this works even when the string is not valid UTF-16, so it's useful to people in that case. If you don't personally find it useful you don't have to use it. – user541686 Nov 08 '16 at 19:02
  • @Mehrdad I've been waffling back and forth on the argument and I can see good points on both sides. But in the end, I wonder why would you ever need to convert a string with invalid characters, assuming that's the only reason you would use this solution. Shouldn't we desperately try to avoid converting bytes to a string that technically isn't valid? Shouldn't data like that be forced to remain as a byte array to avoid giving anyone the impression that there's valid character data there? – BlueMonkMN Jan 17 '17 at 20:27
  • @BlueMonkMN: I think your mistake is that this is ***not*** a method for converting bytes to strings and back to bytes. It is a method for converting strings to bytes and back to strings. *There is a very crucial difference here.* If you're asking why the user even *has* a string with invalid characters, or why `string` even allows that, then that's an entirely different question, and not something I can or will attempt to answer here. I'm just trying to provide an answer that doesn't depend on the string's encoding (if any). – user541686 Jan 17 '17 at 21:10
  • @Mehrdad That's my point: I don't know how you could end up with a string containing invalid characters without converting it from an array of bytes. Every other "proper" means of generating strings that I can think of would not allow this because it would go through an encoding or be generated by a process that cannot return an invalid character. So my expectation is that one can always assume that .NET strings contain only valid characters, unless they use code like that which you have provided. – BlueMonkMN Jan 17 '17 at 21:18
  • 3
    @BlueMonkMN: *"That's my point: I don't know how you could end up with a string containing invalid characters without converting it from an array of bytes."* ...well here's one: `"\uD800" + "\uDC00"` both of these strings are invalid but their concatenation is valid. Maybe you want to convert each one to bytes, transmit them, and convert them back and then concatenate. Maybe they were generated by similarly splitting up a valid string. There's a million ways you could end up with invalid strings... – user541686 Jan 17 '17 at 21:26
  • 2
    OP doesn't state _why_ he wants to "simply get the bytes", but my guess is that he assumes that `System.Text.Encoding.Unicode.GetBytes(); ` is doing some kind of expensive conversion that he wants to avoid. Unfortunately, what you propose here is _less_ efficient due to the double copy. Also, Endianness is important. OP wants to encrypt the string. It's likely that he doesn't do that to keep the encrypted stryng in memory. It will be written to disk or transferred across the network. What if it is to be decrypted on a machine with different endianness, now or in the future? – Kris Vandermotten Apr 28 '17 at 13:57
  • 1
    @KrisVandermotten: *"What if it is to be decrypted on a machine with different endianness, now or in the future?"* ...... sigh. How much of this answer & follow-up discussion did you read before posting your comment?? *Literally the 2nd-most-upvoted comment* -- which is the **top comment before you expand the comments** -- said the *exact* thing about endianness as you just did, and *literally the 5th-most-upvoted comment* -- which is the *second* comment before expansion -- was my reply to it... and they were from 5 years ago!! – user541686 Apr 29 '17 at 08:48
  • 1
    @Mehrdad My second point is that your comment that "the whole point of this is if you want to use it on the same kind of system, with the same set of functions" is meaningless. Encryption is only useful if you do IO, write the encrypted stream to somewhere else, to be read back in a different place or time. You did not address that. More importantly it is trumped by my first point: why would anyone want to use your function? Is it less efficient than the built-in one. – Kris Vandermotten Apr 29 '17 at 09:09
  • 1
    And finally, if the use case was different, and OP really wanted to get to the bytes, then going unsafe and casting the `char*` to a `byte*` would be the most direct answer, not copying the string twice. – Kris Vandermotten Apr 29 '17 at 09:11
  • 1
    @KrisVandermotten: (a) [**This *is* a thing**](https://en.wikipedia.org/wiki/Data_at_rest#Encryption), (b) Unsafe code requires extra runtime privileges you might not have, (c) If someone writes or uses unsafe code wrong it'll silently corrupt memory instead of crashing, (d) Nowhere did I claim this is the fastest answer, (e) Nowhere did the OP claim he *wants* the fastest answer either, (f) Someone *already* posted an answer using unsafe code so go upvote it instead of arguing with me, (g) I'm just answering the question; if you don't like the OP's use case, go argue with him. – user541686 Apr 29 '17 at 10:28
  • I've changed my mind about this. Something I didn't see on original viewing -- Characters are a fixed size in c# -- this is really just an array copy. Creating the array might require interpretation; loading the array back to a string might also. But the array itself is recreated without interpretation because the Chars are the same size, which allows recreation of the original char array. That's all this depends on. – Gerard ONeill Nov 01 '17 at 19:18
  • Are you aware of the fact that `length * sizeof(char)` won't give you the size of the text in bytes? There are encodings like UTF-8 where the size of a character can **vary**. In case of UTF-8 it can be anything from 1 byte to four. – mg30rg Dec 05 '17 at 16:19
  • 1
    @chris's comments of "I use this solution for converting password strings to byte[] before salting and hashing them. In this use case, I absolutely do not care about encoding at all" should finally convince you to delete this answer. It is obviously not clear enough to be useful if someone actually believes that enough to defend it. – John Rasch Jan 11 '18 at 16:36
  • @John Rasch I still don't see what's wrong with that. .NET strings always have the same, fixed-length encoding (i.e. UTF-16). So it's safe to assume that two .NET strings with identical char sequence are internally represented as identical byte sequence. – chris Jan 13 '18 at 17:34
  • I withdraw the "fixed-length" part of the comment above, that is admittedly incorrect. Still, I don't see why any two equal .NET strings should ever be represented as different byte sequences in memory. – chris Jan 13 '18 at 17:49
  • @chris: `string.Equals("\u0041\u030A", "\u00C5", StringComparison.InvariantCulture)` is one example, but it also has absolutely nothing to do with my answer, since you would have exactly the same problem if you specify an encoding. – user541686 Jan 28 '18 at 04:13
  • 2
    The fact that the original data is stored in a string already implies encoding. It's _not_ merely an array of bytes to be toyed with as you please. If it were so, why have you stored it in a string? That's...Just stupid. The assertion being made in here that people are incorrectly "interpreting" the bytes is flat-out incorrect, because they bytes have _already_ been interpreted by the fact the original data was stored in a .net string. The consumer of the resulting bytes is going to have to implicitly know what the encoding was to make any use of the original bytes whatsoever. – dodexahedron Jan 29 '18 at 09:49
  • 2
    This answer is so wrong, I'm shocked to see it's got this many upvotes. Yes, in theory it works. But that's exactly where the possible use cases for this code end. Anyone using this code in production should be fired on the spot. And the argument "It doesn't matter if the string contains invalid characters" is BS, because your strings will never contain invalid characters to begin with. – Tom Lint Feb 15 '18 at 09:04
  • 6
    This is an encoding. You've just invented your own encoding instead of using a standard one. – user253751 Jul 10 '18 at 22:25
  • 2
    But moving along, since you did mean "injection" then I think your answer could be improved by making clear (in the answer text and not just comments) that `GetString` does _not_ work for arbitrary byte arrays but only for those produced by `GetBytes`. I think this also implies either (a) removing the statement "you DON'T need to worry about encoding _if_ the bytes don't need to be interpreted" because the use of `GetString` implies interpretation (since only "meaningful" byte arrays work as input) or (b) removing `GetString` itself (which technically isn't necessary for OP). Thoughts? – Matt Thomas Jun 03 '19 at 20:03
  • @MattThomas: I can't remove your comments since I'm not a mod. Removing `GetString` would make `GetBytes` pointless since you need to be able to reconstruct the original string. I never suggested `GetBytes` works on "any string" either, that's something else that you read into my comments out of nowhere. If you simply try it it will clearly fail half the time. I was explaining how to get bytes from a string. Re: the note, I didn't add it in the first place; people argued & nitpicked & flamed me for >3 years until someone else added it. I'm sick of arguing over random nitpicking at this point. – user541686 Jun 03 '19 at 20:08
  • "I'm sick of arguing over random nitpicking at this point" ROFL! I'm sorry for laying on more of it, just trying to be clear and precise in short space. I'm not saying that you said it does work on arbitrary byte arrays, just that the answer could be improved in a small way so that people don't falsely assume it does when they cargo code. I'll make the edit and we'll see how the mobs respond :) – Matt Thomas Jun 03 '19 at 20:13
  • beware that `str.ToCharArray()` actually uses an hidden "encoding" to convert the 16-bit codeunits to arrays of Char; if it's not lossy, it will push at least two chars per codeunit in the string; if it was lossy, it will drop the highbyte of each code unit to return one Char. Then this depends on the bitsize of each "Char" (which is not necessarily an 8-bit "byte"). Look for the difference between "byte" and "char" data types, they're not equivalent in C# (same thing in C and C++). To allow lossless conversions with strings without exceptions, a "char" needs to be an unsigned 16-bit exactly. – verdy_p Sep 07 '19 at 16:26
  • For being lossless, a 16-bit "char" will allow arbitrary codeunits in any order (including unpaired surrogates). So an array of "char" it is also not warrantied to be valid UTF-16. Strings are just convenient compacted and unmutable arrays of 16-bit units, frequently (not always) more efficient in storage and processing speed compared to classic mutable arrays of numbers. – verdy_p Sep 07 '19 at 16:31
  • Because the .NET runtime might change its encoding in some future version. Then your code will break if the runtimes of two systems need to communicate are different or if the runtime is upgraded after storing the bytes obtained in this manner. I'm not sure the internal representation of a string is even guaranteed or documented. An explicit encoding will be far less likely to suffer from these problems. – jpmc26 Sep 11 '19 at 03:26
  • 1
    You just assume the input is standard English string. How can you make sure `str.Length * sizeof(char)` is the size of the string? I mean the string doesn't even have to be English characters. For example, in UTF-8 "你好", which returns length as 2, but each of them is 3 bytes instead of 1 byte in UTF-8! – joe Nov 27 '19 at 11:21
  • @joe: No, he doesn't assume English or Basic Multilingual Plane. `str;Length * sizeof(char)` is somewhat wrong, it would be slightly better to fetch `Length` from the `GetCharArray()` result. That `chararray.Length * sizeof(char)` really is exactly the size of the `char` array (content, not counting .NET vtable, synchronization, and garbage collector bookkeeping overhead), and **a .NET string is a `char` array**, so it also is the "size" of the string (content). – Ben Voigt Dec 07 '20 at 22:57
1126

It depends on the encoding of your string (ASCII, UTF-8, ...).

For example:

byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString);
byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);

A small sample why encoding matters:

string pi = "\u03a0";
byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi);
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);

Console.WriteLine (ascii.Length); //Will print 1
Console.WriteLine (utf8.Length); //Will print 2
Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?'

ASCII simply isn't equipped to deal with special characters.

Internally, the .NET framework uses UTF-16 to represent strings, so if you simply want to get the exact bytes that .NET uses, use System.Text.Encoding.Unicode.GetBytes (...).

See Character Encoding in the .NET Framework (MSDN) for more information.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
bmotmans
  • 15,290
  • 5
  • 18
  • 14
  • 15
    But, why should encoding be taken into consideration? Why can't I simply get the bytes without having to see what encoding is being used? Even if it were required, shouldn't the String object itself know what encoding is being used and simply dump what is in memory? – Agnel Kurian Jan 23 '09 at 13:48
  • 62
    A .NET strings are always encoded as Unicode. So use System.Text.Encoding.Unicode.GetBytes(); to get the set of bytes that .NET would using to represent the characters. However why would you want that? I recommend UTF-8 especially when most characters are in the western latin set. – AnthonyWJones Jan 23 '09 at 14:33
  • 1
    There's also System.Text.Encoding.Default – Joel Coehoorn Jan 23 '09 at 15:39
  • 8
    Also: the exact bytes used internally in the string _don't matter_ if the system that retrieves them doesn't handle that encoding or handles it as the wrong encoding. If it's all within .Net, why convert to an array of bytes at all. Otherwise, it's better to be explicit with your encoding – Joel Coehoorn Jan 23 '09 at 15:42
  • 11
    @Joel, Be careful with System.Text.Encoding.Default as it could be different on each machine it is run. That's why it's recommended to always specify an encoding, such as UTF-8. – Ash Jan 28 '10 at 09:01
  • 26
    You don't need the encodings unless you (or someone else) actually intend(s) to *interpret* the data, instead of treating it as a generic "block of bytes". For things like compression, encryption, etc., worrying about the encoding is meaningless. See [my answer](http://stackoverflow.com/a/10380166/541686) for a way to do this without worrying about the encoding. (I might have given a -1 for saying you need to worry about encodings when you don't, but I'm not feeling particularly mean today. :P) – user541686 Apr 30 '12 at 07:55
  • 2
    Good discussion, sometimes I need one of the above alternatives. But also looks like: "One fool can ask more than seven wise men can answer" :-) – Roland Mar 26 '13 at 14:49
  • 7
    +1; @Mehrdad: The `GetString` method *is* and interpretation of the output of the `GetBytes` method. This is why you *have* to worry to use the same encoding in both methods. – chiccodoro Jul 17 '13 at 07:57
  • 4
    I think it's important to note that it *doesn't "[depend] on the encoding of your string"*. .NET hides this from you. From what I can tell, a String is represented by a sequence of System.Chars, which are represented as UTF-16. What matters is that you must store the bytes in *some encoding* and know to retrieve them with the *same encoding*. To not do that is the same as password-protecting your files and trying to use a different password to unprotect them. – Millie Smith Feb 05 '16 at 23:39
  • 1
    I don't think the encoding in .Net is actually UTF-16; that implies control bits. It simply saves all text as raw 16-bit code words, without the expandability. From what I've seen, this also means it doesn't support unicode above code word 0xFFFF – Nyerguds Dec 17 '19 at 13:28
296

The accepted answer is very, very complicated. Use the included .NET classes for this:

const string data = "A string with international characters: Norwegian: ÆØÅæøå, Chinese: 喂 谢谢";
var bytes = System.Text.Encoding.UTF8.GetBytes(data);
var decoded = System.Text.Encoding.UTF8.GetString(bytes);

Don't reinvent the wheel if you don't have to...

Vlad
  • 17,187
  • 4
  • 39
  • 68
Erik A. Brandstadmoen
  • 9,864
  • 2
  • 34
  • 52
  • 15
    In case the accepted answer gets changed, for record purposes, it is Mehrdad's answer at this current time and date. Hopefully the OP will revisit this and accept a better solution. – Thomas Eding Sep 27 '13 at 18:20
  • 9
    good in principle but, the encoding should be `System.Text.Encoding.Unicode` to be equivalent to Mehrdad's answer. – Jodrell Nov 25 '14 at 09:08
  • 5
    The question has been edited an umptillion times since the original answer, so, maybe my answer is a bit outdates. I never intended to give an exace equivalent to Mehrdad's answer, but give a sensible way of doing it. But, you might be right. However, the phrase "get what bytes the string has been stored in" in the original question is very unprecise. Stored, where? In memory? On disk? If in memory, `System.Text.Encoding.Unicode.GetBytes` would probably be more precise. – Erik A. Brandstadmoen Nov 26 '14 at 11:36
  • After reviewing all the answers, the many comments, and my inspection of memory (don't forget, Visual Studio allows for memory inspection) that the correct answer is `Encoding.Default.GetBytes`. – AMissico Feb 10 '16 at 21:37
  • 8
    @AMissico, your suggestion is buggy, unless you are sure your string is compatible with your system default encoding (string containing only ASCII chars in your system default legacy charset). But nowhere the OP states that. – Frédéric Apr 06 '16 at 20:53
  • @Frédéric; I am just stating my opinion after reviewing all the information and running test scenarios with Unicode characters. I have also used TextPad, HexEdit, WinHex, and Visual Studio to view those bytes. The `Encoding.Default.GetBytes` results are the same as those applications. I am not providing an answer to the OP question. – AMissico Apr 07 '16 at 18:24
  • 6
    @AMissico It can cause the program to give different results _on different systems_ though. That's _never_ a good thing. Even if it's for making a hash or something (I assume that's what OP means with 'encrypt'), the same string should still always give the same hash. – Nyerguds Apr 22 '16 at 10:33
  • 1
    +1 for UTF-8. That is what is being assumed by those that say encoding does not matter. UTF-8 is a strict value for value encoding of an unsigned char (BYTE). Everything else is...not. – jinzai Jun 22 '16 at 15:11
  • 1
    @jinzai, but what about UTF-16, which .NET uses internally? – NH. Nov 08 '17 at 17:02
  • UTF-16 is part of the "everything else" I mentioned. The original question -- was referring to 'byte representations'. With respect to UTF-16 -- the values map the same for ASCII, but -- they are words, not bytes. I am fairly certain that everyone knows that .NET uses UTF-16 internally, however -- I always use UTF-8 for things like XML. .NET now respects that, at least. – jinzai Nov 10 '17 at 15:13
120
BinaryFormatter bf = new BinaryFormatter();
byte[] bytes;
MemoryStream ms = new MemoryStream();

string orig = "喂 Hello 谢谢 Thank You";
bf.Serialize(ms, orig);
ms.Seek(0, 0);
bytes = ms.ToArray();

MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());

MessageBox.Show("Original string Length: " + orig.Length.ToString());

for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt

BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();            
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
string sx = (string)bfx.Deserialize(msx);

MessageBox.Show("Still intact :" + sx);

MessageBox.Show("Deserialize string Length(still intact): " 
    + sx.Length.ToString());

BinaryFormatter bfy = new BinaryFormatter();
MemoryStream msy = new MemoryStream();
bfy.Serialize(msy, sx);
msy.Seek(0, 0);
byte[] bytesy = msy.ToArray();

MessageBox.Show("Deserialize bytes Length(still intact): " 
   + bytesy.Length.ToString());
Michael Buen
  • 36,153
  • 6
  • 84
  • 113
  • 2
    You could use the same BinaryFormatter instance for all of those operations – Joel Coehoorn Jan 23 '09 at 17:25
  • 3
    Very Interesting. Apparently it will drop any high surrogate Unicode character. See the documentation on [[BinaryFormatter](http://msdn.microsoft.com/en-us/library/system.runtime.serialization.formatters.binary.binaryformatter%28v=VS.100%29.aspx)] –  Nov 18 '10 at 18:51
97

You need to take the encoding into account, because 1 character could be represented by 1 or more bytes (up to about 6), and different encodings will treat these bytes differently.

Joel has a posting on this:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Zhaph - Ben Duguid
  • 25,726
  • 4
  • 75
  • 111
  • 7
    "1 character could be represented by 1 or more bytes" I agree. I just want those bytes regardless of what encoding the string is in. The only way a string can be stored in memory is in bytes. Even characters are stored as 1 or more bytes. I merely want to get my hands on them bytes. – Agnel Kurian Jan 23 '09 at 14:07
  • 17
    You don't need the encodings unless you (or someone else) actually intend(s) to *interpret* the data, instead of treating it as a generic "block of bytes". For things like compression, encryption, etc., worrying about the encoding is meaningless. See [my answer](http://stackoverflow.com/a/10380166/541686) for a way to do this without worrying about the encoding. – user541686 Apr 30 '12 at 07:54
  • 9
    @Mehrdad - Totally, but the original question, as stated when I initially answered, didn't caveat what OP was going to happen with those bytes after they'd converted them, and for future searchers the information around that is pertinent - this is covered by [Joel's answer](http://stackoverflow.com/a/473419/33051) quite nicely - and as you state within your answer: provided you stick within the .NET world, and use your methods to convert to/from, you're happy. As soon as you step outside of that, encoding will matter. – Zhaph - Ben Duguid Apr 30 '12 at 10:48
  • One *code point* can be represented by up to *4* bytes. (One UTF-32 code unit, a UTF-16 surrogate pair, or 4 bytes of UTF-8.) The values that UTF-8 would need more than 4 bytes for are outside the 0x0..0x10FFFF range of Unicode. ;-) – DevSolar Oct 08 '18 at 15:05
96

This is a popular question. It is important to understand what the question author is asking, and that it is different from what is likely the most common need. To discourage misuse of the code where it is not needed, I've answered the later first.

Common Need

Every string has a character set and encoding. When you convert a System.String object to an array of System.Byte you still have a character set and encoding. For most usages, you'd know which character set and encoding you need and .NET makes it simple to "copy with conversion." Just choose the appropriate Encoding class.

// using System.Text;
Encoding.UTF8.GetBytes(".NET String to byte array")

The conversion may need to handle cases where the target character set or encoding doesn't support a character that's in the source. You have some choices: exception, substitution or skipping. The default policy is to substitute a '?'.

// using System.Text;
var text = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes("You win €100")); 
                                                      // -> "You win ?100"

Clearly, conversions are not necessarily lossless!

Note: For System.String the source character set is Unicode.

The only confusing thing is that .NET uses the name of a character set for the name of one particular encoding of that character set. Encoding.Unicode should be called Encoding.UTF16.

That's it for most usages. If that's what you need, stop reading here. See the fun Joel Spolsky article if you don't understand what an encoding is.

Specific Need

Now, the question author asks, "Every string is stored as an array of bytes, right? Why can't I simply have those bytes?"

He doesn't want any conversion.

From the C# spec:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So, we know that if we ask for the null conversion (i.e., from UTF-16 to UTF-16), we'll get the desired result:

Encoding.Unicode.GetBytes(".NET String to byte array")

But to avoid the mention of encodings, we must do it another way. If an intermediate data type is acceptable, there is a conceptual shortcut for this:

".NET String to byte array".ToCharArray()

That doesn't get us the desired datatype but Mehrdad's answer shows how to convert this Char array to a Byte array using BlockCopy. However, this copies the string twice! And, it too explicitly uses encoding-specific code: the datatype System.Char.

The only way to get to the actual bytes the String is stored in is to use a pointer. The fixed statement allows taking the address of values. From the C# spec:

[For] an expression of type string, ... the initializer computes the address of the first character in the string.

To do so, the compiler writes code skip over the other parts of the string object with RuntimeHelpers.OffsetToStringData. So, to get the raw bytes, just create a pointer to the string and copy the number of bytes needed.

// using System.Runtime.InteropServices
unsafe byte[] GetRawBytes(String s)
{
    if (s == null) return null;
    var codeunitCount = s.Length;
    /* We know that String is a sequence of UTF-16 codeunits 
       and such codeunits are 2 bytes */
    var byteCount = codeunitCount * 2; 
    var bytes = new byte[byteCount];
    fixed(void* pRaw = s)
    {
        Marshal.Copy((IntPtr)pRaw, bytes, 0, byteCount);
    }
    return bytes;
}

As @CodesInChaos pointed out, the result depends on the endianness of the machine. But the question author is not concerned with that.

Community
  • 1
  • 1
Tom Blodget
  • 18,829
  • 2
  • 35
  • 64
  • In general, is not correct to set `byteCount` to twice the string length. For Unicode code points outside the Basic Multilingual Plane, there will be two 16-bit code units for each character. – Jan Hettich Feb 04 '14 at 02:33
  • 4
    @Jan That's correct but string length already gives the number of code-units (not codepoints). – Tom Blodget Feb 04 '14 at 02:35
  • 1
    Thanks for pointing that out! From MSDN: "The `Length` property [of `String`] returns the number of `Char` objects in this instance, not the number of Unicode characters." Your example code is therefore correct as written. – Jan Hettich Feb 04 '14 at 05:42
  • I don't think `Char` is really an "encoding-specific" type; from what I can tell, there is a specified 1:1 relationship between `Char` values and `UInt16` values, any `Char[]` can be converted to a string of the same length, and any such string may be converted to a `Char[]` equal to the original, *whether or not the sequence of `Char` values ever formed a valid UTF-16 string*. – supercat Nov 12 '14 at 22:29
  • 1
    @supercat "The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units."—_C# 5 Specification._ Although, yes, there is nothing that prevents a invalid Unicode string: `new String(new []{'\uD800', '\u0030'})` – Tom Blodget Nov 13 '14 at 00:15
  • @TomBlodget: I can't find anything which indicates that all values 0x0000-0xFFFF may be regarded as "code units", but the term "sequence of code units" would imply that the type could accommodate sequences of code *units* which do not represent sequences of code *points*. I really don't know any type other than `String` that better encapsulates the concept of "immutable sequence of 16-bit values"; because `System.String` has special Runtime support which is not available for any other type, it can offer better performance for many operations than would be possible with any other type. – supercat Nov 13 '14 at 17:50
  • 1
    @TomBlodget: Interestingly, if one takes instances of `Globalization.SortKey`, extracts the `KeyData`, and packs the resulting bytes from each into a `String` [two bytes per character, *MSB first*], calling `String.CompareOrdinal` upon the resulting strings will be substantially faster than calling `SortKey.Compare` on the instances of `SortKey`, or even calling `memcmp` on those instances. Given that, I wonder why `KeyData` returns a `Byte[]` rather than a `String`? – supercat Nov 13 '14 at 17:56
  • @TomBlodget +1 great answer! For the sake of completeness, it'd be nice to add how to back in reverse. This worked for me: `unsafe string GetString(byte[] bytes) { fixed (byte* bptr = bytes) { char* cptr = (char*)(bptr); var result = new string(cptr, 0, bytes.Length / 2); return result; } }` – vexe Mar 14 '15 at 13:55
  • 1
    Alas, the right answer, but years too late, will never has as many votes as the accepted. Due to TL;DR people will think the accepted answer rocks. copyenpastit and up-vote it. – Martin Capodici Jun 30 '15 at 02:38
  • Love this answer because of the approach, but it is wrong -- A surrogate pair would be a single code unit, but would be 4 bytes. So codeunitcount * 2 is not correct. – Gerard ONeill Aug 18 '15 at 15:59
  • 1
    @GerardONeill Thanks for the feedback. According the the C# spec, a .NET string is counted sequence of UTF-16 code units. A codepoint is encoded in one or more code units. In the case of UTF-16, that's one or two. When two, they are the "high" surrogate followed by the "low" surrogate. So, `codeunitcount * 2` is the correct number of bytes for a _code unit._ The code does not count _codepoints_ at all. – Tom Blodget Aug 18 '15 at 19:19
  • Sorry, I didn't know the semantics of 'Code Unit'. Did not realize the horror of String.Length with surrogates; it seemed obvious that length would count full blown chars (codepoints). So yes, what you have here will work. This also explains why and how unmatched surrogates are allowed in strings. – Gerard ONeill Aug 18 '15 at 19:47
  • @GerardONeill Yes, horror. I had been assuming that strings had to valid Unicode (including matching surrogates) but, alas, nothing says that it has to be true. – Tom Blodget Aug 19 '15 at 23:59
  • 1
    @TomBlodget: You don't need `fixed` or `unsafe` code, you can also do `var gch = GCHandle.Alloc("foo", GCHandleType.Pinned); var arr = new byte[sizeof(char) * ((string)gch.Target).Length]; Marshal.Copy(gch.AddrOfPinnedObject(), arr, 0, arr.Length); gch.Free();` – user541686 Jan 28 '18 at 04:27
  • @Mehrdad Yes, that is also a good answer that meets the rather limiting non-functional constraints of the question asker. I think Pinned and fixed amount to the same thing but it does eliminate the need for unsafe. – Tom Blodget Jan 28 '18 at 19:05
48

The first part of your question (how to get the bytes) was already answered by others: look in the System.Text.Encoding namespace.

I will address your follow-up question: why do you need to pick an encoding? Why can't you get that from the string class itself?

The answer is in two parts.

First of all, the bytes used internally by the string class don't matter, and whenever you assume they do you're likely introducing a bug.

If your program is entirely within the .Net world then you don't need to worry about getting byte arrays for strings at all, even if you're sending data across a network. Instead, use .Net Serialization to worry about transmitting the data. You don't worry about the actual bytes any more: the Serialization formatter does it for you.

On the other hand, what if you are sending these bytes somewhere that you can't guarantee will pull in data from a .Net serialized stream? In this case you definitely do need to worry about encoding, because obviously this external system cares. So again, the internal bytes used by the string don't matter: you need to pick an encoding so you can be explicit about this encoding on the receiving end, even if it's the same encoding used internally by .Net.

I understand that in this case you might prefer to use the actual bytes stored by the string variable in memory where possible, with the idea that it might save some work creating your byte stream. However, I put it to you it's just not important compared to making sure that your output is understood at the other end, and to guarantee that you must be explicit with your encoding. Additionally, if you really want to match your internal bytes, you can already just choose the Unicode encoding, and get that performance savings.

Which brings me to the second part... picking the Unicode encoding is telling .Net to use the underlying bytes. You do need to pick this encoding, because when some new-fangled Unicode-Plus comes out the .Net runtime needs to be free to use this newer, better encoding model without breaking your program. But, for the moment (and forseeable future), just choosing the Unicode encoding gives you what you want.

It's also important to understand your string has to be re-written to wire, and that involves at least some translation of the bit-pattern even when you use a matching encoding. The computer needs to account for things like Big vs Little Endian, network byte order, packetization, session information, etc.

Joel Coehoorn
  • 362,140
  • 107
  • 528
  • 764
  • 10
    There are areas in .NET where you do have to get byte arrays for strings. Many of the .NET Cryptrography classes contain methods such as ComputeHash() that accept byte array or stream. You have no alternative but to convert a string to a byte array first (choosing an Encoding) and then optionally wrap it in a stream. However as long as you choose an encoding (ie UTF8) an stick with it there are no problems with this. – Ash Jan 28 '10 at 09:33
44

Just to demonstrate that Mehrdrad's sound answer works, his approach can even persist the unpaired surrogate characters(of which many had leveled against my answer, but of which everyone are equally guilty of, e.g. System.Text.Encoding.UTF8.GetBytes, System.Text.Encoding.Unicode.GetBytes; those encoding methods can't persist the high surrogate characters d800 for example, and those just merely replace high surrogate characters with value fffd ) :

using System;

class Program
{     
    static void Main(string[] args)
    {
        string t = "爱虫";            
        string s = "Test\ud800Test"; 

        byte[] dumpToBytes = GetBytes(s);
        string getItBack = GetString(dumpToBytes);

        foreach (char item in getItBack)
        {
            Console.WriteLine("{0} {1}", item, ((ushort)item).ToString("x"));
        }    
    }

    static byte[] GetBytes(string str)
    {
        byte[] bytes = new byte[str.Length * sizeof(char)];
        System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
        return bytes;
    }

    static string GetString(byte[] bytes)
    {
        char[] chars = new char[bytes.Length / sizeof(char)];
        System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
        return new string(chars);
    }        
}

Output:

T 54
e 65
s 73
t 74
? d800
T 54
e 65
s 73
t 74

Try that with System.Text.Encoding.UTF8.GetBytes or System.Text.Encoding.Unicode.GetBytes, they will merely replace high surrogate characters with value fffd

Every time there's a movement in this question, I'm still thinking of a serializer(be it from Microsoft or from 3rd party component) that can persist strings even it contains unpaired surrogate characters; I google this every now and then: serialization unpaired surrogate character .NET. This doesn't make me lose any sleep, but it's kind of annoying when every now and then there's somebody commenting on my answer that it's flawed, yet their answers are equally flawed when it comes to unpaired surrogate characters.

Darn, Microsoft should have just used System.Buffer.BlockCopy in its BinaryFormatter

谢谢!

Community
  • 1
  • 1
Michael Buen
  • 36,153
  • 6
  • 84
  • 113
  • 3
    Don't surrogates have to appear in pairs to form valid code points? If that's the case, I can understand why the data would be mangled. – dtanders Jun 14 '12 at 14:27
  • 1
    @dtanders Yes,that's my thoughts too, they have to appear in pairs, unpaired surrogate characters just happen if you deliberately put them on string and make them unpaired. What I don't know is why other devs keep on harping that we should use encoding-aware approach instead, as they deemed the serialization approach([my answer](http://stackoverflow.com/a/473574/11432),which was an accepted answer for more than 3 years) doesn't keep the unpaired surrogate character intact. But they forgot to check that their encoding-aware solutions doesn't keep the unpaired surrogate character too,the irony ツ – Michael Buen Jun 14 '12 at 23:23
  • If there's a serialization library that uses `System.Buffer.BlockCopy` internally, all encoding-advocacy folks' arguments will be moot – Michael Buen Jun 14 '12 at 23:23
  • The problem with your test is that you have made an invalid string. ["In UTF-16, they must always appear in pairs, as a high surrogate followed by a low surrogate, thus using 32 bits to denote one code point."](http://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates). If you follow /uD800 with /uDC00 then it works fine in all the unicode formats. It is important to note that this is a string, not a char array, so certain restrictions make sense. Also, it works fine even without /uDC00 in UTF7. – Trisped Nov 11 '14 at 19:58
  • 2
    @MichaelBuen It seem to me that the main issue is that you are in big bold letters saying something doesn't matter, rather than saying that it does not matter in their case. As a result, you are encouraging people who look at your answer to make basic programming mistakes which will cause others frustration in the future. Unpaired surrogates are invalid in a string. It is not a char array, so it makes sense that converting a string to another format would result in an error `FFFD` on that character. If you want to do manual string manipulation, use a char[] as recommended. – Trisped Nov 11 '14 at 20:06
  • @Trisped: If one wishes to convert byte arrays to a form which will permit rapid lexicographic comparison (with the ranking of the comparison being that of the first mismatched byte), would anything faster than `String.CompareOrdinal` be usable without "unsafe" code? Converting `Char[]` arrays with unmatched surrogates to `String` for purposes of using `String.CompareOrdinal` on them is nasty, but what approach would be better? – supercat Nov 12 '14 at 21:37
  • 3
    @dtanders: A `System.String` is an immutable sequence of `Char`; .NET has always allowed a `String` object to be constructed from any `Char[]` and export its content to a `Char[]` containing the same values, even if the original `Char[]` contains unpaired surrogates. – supercat Nov 12 '14 at 21:57
  • @supercat Well the docs say a Char is supposed to be UTF-16 so unmatched surrogates are illegal in Chars, too. Reading all this again two years later, I'm thinking something should probably throw an error rather than mangle an illegal byte sequence into a character, but whatever. – dtanders Nov 13 '14 at 21:11
  • @dtanders: No, it's perfectly legal to have unmatched surrogates, but you're concluding otherwise because the [**Unicode terminology**](https://www.unicode.org/glossary/) is confusing you. There is no such thing as a(n) "(in)valid UTF-16 `char`". If you read the C# language specification, it says *"the `char` type represents a UTF-16 code **unit**, and the string type represents a **sequence** of UTF-16 code **units**"*. Note that it does *not* say `string` has to be a well-formed "Unicode string" (and note that even "Unicode String" is *explicitly* permitted to be ill-formed in the glossary). – user541686 Jan 28 '18 at 04:36
  • @MichaelBuen: It might not be a bad idea to edit the note I mentioned above^ into your answer, so that people realize ill-formed Unicode strings are in fact [perfectly legal "Unicode Strings"](https://www.unicode.org/glossary/#unicode_string) (and perfectly legal `string`s). – user541686 Jan 28 '18 at 04:43
41

Try this, a lot less code:

System.Text.Encoding.UTF8.GetBytes("TEST String");
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Nathan
  • 571
  • 1
  • 5
  • 8
  • Then try this `System.Text.Encoding.UTF8.GetBytes("Árvíztűrő tükörfúrógép);`, and cry! It will work, but `System.Text.Encoding.UTF8.GetBytes("Árvíztűrő tükörfúrógép").Length != System.Text.Encoding.UTF8.GetBytes("Arvizturo tukorfurogep").Length` while `"Árvíztűrő tükörfúrógép".Length == "Arvizturo tukorfurogep".Length` – mg30rg Dec 05 '17 at 16:30
  • 9
    @mg30rg: Why do you think your example is strange? Surely in a variable-width encoding not all characters have the same byte lengthes. What's wrong with it? – Vlad Feb 25 '18 at 01:18
  • @Vlad A more valid comment here, though, is that as encoded unicode symbols (so, as bytes), characters which _include_ their own diacritics will give a different result than diacritics split off into modifier symbols _added to_ the character. But iirc there are methods in .net to specifically split those off, to allow getting a consistent byte representation. – Nyerguds Mar 31 '20 at 12:43
25

Well, I've read all answers and they were about using encoding or one about serialization that drops unpaired surrogates.

It's bad when the string, for example, comes from SQL Server where it was built from a byte array storing, for example, a password hash. If we drop anything from it, it'll store an invalid hash, and if we want to store it in XML, we want to leave it intact (because the XML writer drops an exception on any unpaired surrogate it finds).

So I use Base64 encoding of byte arrays in such cases, but hey, on the Internet there is only one solution to this in C#, and it has bug in it and is only one way, so I've fixed the bug and written back procedure. Here you are, future googlers:

public static byte[] StringToBytes(string str)
{
    byte[] data = new byte[str.Length * 2];
    for (int i = 0; i < str.Length; ++i)
    {
        char ch = str[i];
        data[i * 2] = (byte)(ch & 0xFF);
        data[i * 2 + 1] = (byte)((ch & 0xFF00) >> 8);
    }

    return data;
}

public static string StringFromBytes(byte[] arr)
{
    char[] ch = new char[arr.Length / 2];
    for (int i = 0; i < ch.Length; ++i)
    {
        ch[i] = (char)((int)arr[i * 2] + (((int)arr[i * 2 + 1]) << 8));
    }
    return new String(ch);
}
Tshilidzi Mudau
  • 5,472
  • 6
  • 32
  • 44
Gman
  • 1,522
  • 1
  • 20
  • 35
  • Instead of using your custom method to convert a byte array to base64, all you had to do was use the built-in converter: Convert.ToBase64String(arr); – Makotosan Feb 10 '12 at 15:53
  • @Makotosan thank you, but I did use `Convert.ToBase64String(arr); ` for the base64 conversions `byte[] (data) string (serialized data to store in XML file)`. But to get the initial `byte[] (data)` I needed to do something with a `String` that contained _binary_ data (it's the way MSSQL returned it to me). SO the functions above are for `String (binary data) byte[] (easy accessible binary data)`. – Gman Mar 06 '12 at 19:15
24

Also please explain why encoding should be taken into consideration. Can't I simply get what bytes the string has been stored in? Why this dependency on encoding?!!!

Because there is no such thing as "the bytes of the string".

A string (or more generically, a text) is composed of characters: letters, digits, and other symbols. That's all. Computers, however, do not know anything about characters; they can only handle bytes. Therefore, if you want to store or transmit text by using a computer, you need to transform the characters to bytes. How do you do that? Here's where encodings come to the scene.

An encoding is nothing but a convention to translate logical characters to physical bytes. The simplest and best known encoding is ASCII, and it is all you need if you write in English. For other languages you will need more complete encodings, being any of the Unicode flavours the safest choice nowadays.

So, in short, trying to "get the bytes of a string without using encodings" is as impossible as "writing a text without using any language".

By the way, I strongly recommend you (and anyone, for that matter) to read this small piece of wisdom: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Konamiman
  • 47,560
  • 16
  • 107
  • 133
  • 2
    Allow me to clarify: An encoding has been used to translate "hello world" to physical bytes. Since the string is stored on my computer, I am sure that it must be stored in bytes. I merely want to access those bytes to save them on disk or for any other reason. I do not want to interpret these bytes. Since I do not want to interpret these bytes, the need for an encoding at this point is as misplaced as requiring a phone line to call printf. – Agnel Kurian Jul 16 '09 at 15:30
  • 3
    But again, there is no concept of text-to-physical-bytes-translation unless yo use an encoding. Sure, the compiler stores the strings somehow in memory - but it is just using an internal encoding, which you (or anyone except the compiler developer) do not know. So, whatever you do, you need an encoding to get physical bytes from a string. – Konamiman Jul 22 '09 at 08:35
  • @Agnel Kurian: It is of course true, that a string has a bunch of bytes somewhere that store its content (UTF-16 afair). But there is a good reason to prevent you from accessing it: strings are immutable and if you could obtain the internal byte[] array, you could modify it, too. This breaks immutability, which is vital because multiple strings may share the same data. Using an UTF-16 encoding to get the string will probably just copy the data out. – ollb May 14 '11 at 00:06
  • 2
    @Gnafoo, A copy of the bytes will do. – Agnel Kurian May 14 '11 at 05:06
22

C# to convert a string to a byte array:

public static byte[] StrToByteArray(string str)
{
   System.Text.UTF8Encoding  encoding=new System.Text.UTF8Encoding();
   return encoding.GetBytes(str);
}
iliketocode
  • 6,652
  • 4
  • 41
  • 57
Shyam sundar shah
  • 2,284
  • 1
  • 21
  • 35
17

You can use the following code for conversion between string and byte array.

string s = "Hello World";

// String to Byte[]

byte[] byte1 = System.Text.Encoding.Default.GetBytes(s);

// OR

byte[] byte2 = System.Text.ASCIIEncoding.Default.GetBytes(s);

// Byte[] to string

string str = System.Text.Encoding.UTF8.GetString(byte1);
Jarvis Stark
  • 613
  • 5
  • 11
  • VUPthis one solved my problem ( byte[] ff = ASCIIEncoding.ASCII.GetBytes(barcodetxt.Text);) – r.hamd Sep 09 '15 at 13:19
17
byte[] strToByteArray(string str)
{
    System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
    return enc.GetBytes(str);
}
gkrogers
  • 7,629
  • 2
  • 27
  • 34
  • But, why should encoding be taken into consideration? Why can't I simply get the bytes without having to see what encoding is being used? Even if it were required, shouldn't the String object itself know what encoding is being used and simply dump what is in memory? – Agnel Kurian Jan 23 '09 at 13:46
  • 5
    This doesn't always work. Some special characters can get lost in using such a method I've found the hard way. – JB King Jan 23 '09 at 17:14
17

With the advent of Span<T> released with C# 7.2, the canonical technique to capture the underlying memory representation of a string into a managed byte array is:

byte[] bytes = "rubbish_\u9999_string".AsSpan().AsBytes().ToArray();

Converting it back should be a non-starter because that means you are in fact interpreting the data somehow, but for the sake of completeness:

string s;
unsafe
{
    fixed (char* f = &bytes.AsSpan().NonPortableCast<byte, char>().DangerousGetPinnableReference())
    {
        s = new string(f);
    }
}

The names NonPortableCast and DangerousGetPinnableReference should further the argument that you probably shouldn't be doing this.

Note that working with Span<T> requires installing the System.Memory NuGet package.

Regardless, the actual original question and follow-up comments imply that the underlying memory is not being "interpreted" (which I assume means is not modified or read beyond the need to write it as-is), indicating that some implementation of the Stream class should be used instead of reasoning about the data as strings at all.

John Rasch
  • 57,880
  • 19
  • 101
  • 136
  • `new string(f)` is wrong, you at least need to use the constructor overload that accepts an explicit length if you want any hope of round-tripping all strings. – Ben Voigt Dec 07 '20 at 23:09
13

I'm not sure, but I think the string stores its info as an array of Chars, which is inefficient with bytes. Specifically, the definition of a Char is "Represents a Unicode character".

take this example sample:

String str = "asdf éß";
String str2 = "asdf gh";
EncodingInfo[] info =  Encoding.GetEncodings();
foreach (EncodingInfo enc in info)
{
    System.Console.WriteLine(enc.Name + " - " 
      + enc.GetEncoding().GetByteCount(str)
      + enc.GetEncoding().GetByteCount(str2));
}

Take note that the Unicode answer is 14 bytes in both instances, whereas the UTF-8 answer is only 9 bytes for the first, and only 7 for the second.

So if you just want the bytes used by the string, simply use Encoding.Unicode, but it will be inefficient with storage space.

iliketocode
  • 6,652
  • 4
  • 41
  • 57
Ed Marty
  • 39,011
  • 19
  • 96
  • 153
10

The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string.

Utf8 is a popular encoding, it is compact and not lossy.

Hans Passant
  • 873,011
  • 131
  • 1,552
  • 2,371
  • 3
    UTF-8 is compact only if the majority of your characters are in the English (ASCII) character set. If you had a long string of Chinese characters, UTF-16 would be a more compact encoding than UTF-8 for that string. This is because UTF-8 uses one byte to encode ASCII, and 3 (or maybe 4) otherwise. – Joel Mueller Jan 23 '09 at 20:40
  • 7
    True. But, how can you not know about encoding if you're familiar with handling Chinese text? – Hans Passant Jan 24 '09 at 03:40
9

Use:

    string text = "string";
    byte[] array = System.Text.Encoding.UTF8.GetBytes(text);

The result is:

[0] = 115
[1] = 116
[2] = 114
[3] = 105
[4] = 110
[5] = 103
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
mashet
  • 667
  • 1
  • 9
  • 14
  • OP specifically asks to NOT specify an encoding... "without manually specifying a specific encoding" – Ferdz Aug 30 '18 at 13:40
8

Fastest way

public static byte[] GetBytes(string text)
{
    return System.Text.ASCIIEncoding.UTF8.GetBytes(text);
}

EDIT as Makotosan commented this is now the best way:

Encoding.UTF8.GetBytes(text)
Alessandro Annini
  • 1,461
  • 1
  • 15
  • 30
8

The closest approach to the OP's question is Tom Blodget's, which actually goes into the object and extracts the bytes. I say closest because it depends on implementation of the String Object.

"Can't I simply get what bytes the string has been stored in?"

Sure, but that's where the fundamental error in the question arises. The String is an object which could have an interesting data structure. We already know it does, because it allows unpaired surrogates to be stored. It might store the length. It might keep a pointer to each of the 'paired' surrogates allowing quick counting. Etc. All of these extra bytes are not part of the character data.

What you want is each character's bytes in an array. And that is where 'encoding' comes in. By default you will get UTF-16LE. If you don't care about the bytes themselves except for the round trip then you can choose any encoding including the 'default', and convert it back later (assuming the same parameters such as what the default encoding was, code points, bug fixes, things allowed such as unpaired surrogates, etc.

But why leave the 'encoding' up to magic? Why not specify the encoding so that you know what bytes you are gonna get?

"Why is there a dependency on character encodings?"

Encoding (in this context) simply means the bytes that represent your string. Not the bytes of the string object. You wanted the bytes the string has been stored in -- this is where the question was asked naively. You wanted the bytes of string in a contiguous array that represent the string, and not all of the other binary data that a string object may contain.

Which means how a string is stored is irrelevant. You want a string "Encoded" into bytes in a byte array.

I like Tom Bloget's answer because he took you towards the 'bytes of the string object' direction. It's implementation dependent though, and because he's peeking at internals it might be difficult to reconstitute a copy of the string.

Mehrdad's response is wrong because it is misleading at the conceptual level. You still have a list of bytes, encoded. His particular solution allows for unpaired surrogates to be preserved -- this is implementation dependent. His particular solution would not produce the string's bytes accurately if GetBytes returned the string in UTF-8 by default.


I've changed my mind about this (Mehrdad's solution) -- this isn't getting the bytes of the string; rather it is getting the bytes of the character array that was created from the string. Regardless of encoding, the char datatype in c# is a fixed size. This allows a consistent length byte array to be produced, and it allows the character array to be reproduced based on the size of the byte array. So if the encoding were UTF-8, but each char was 6 bytes to accommodate the largest utf8 value, it would still work. So indeed -- encoding of the character does not matter.

But a conversion was used -- each character was placed into a fixed size box (c#'s character type). However what that representation is does not matter, which is technically the answer to the OP. So -- if you are going to convert anyway... Why not 'encode'?

Gerard ONeill
  • 3,193
  • 31
  • 22
  • These characters are ***not supported*** by UTF-8 or UTF-16 or even UTF-32 for exapmle: `` & `(Char) 55906` & `(Char) 55655`. So you may be wrong and Mehrdad's answer is a safe conversion without considering what type of encodings are used. – Mojtaba Rezaeian Feb 11 '16 at 19:48
  • Raymon, the characters are already represented by some unicode value -- and all unicode values can be represented by all the utf's. Is there a longer explanation of what you are talking about? What character encoding do those two values (or 3..) exist in? – Gerard ONeill Feb 11 '16 at 20:47
  • They are invalid characters which not supported by any encoding ranges. This not means they are 100% useless. A code which converts any type of string to its byte array equivalent regardless of the encodings is not a wrong solution at all and have its own usages on desired occasions. – Mojtaba Rezaeian Feb 11 '16 at 21:02
  • 1
    Ok, then I think you are not understanding the problem. We know it is a unicode compliant array -- in fact, because it is .net, we know it is UTF-16. So those characters will not exist there. You also didn't fully read my comment about internal representations changing. A String is an object, not an encoded byte array. So I'm going to disagree with your last statement. You want code to convert all unicode strings to any UTF encoding. This does what you want, correctly. – Gerard ONeill Feb 11 '16 at 22:17
  • Objects are sequence of data originally sequence of bits which describe an object in its current state. So every data in programming languages are convertible to array of bytes(each byte defines 8 bits) as you may need to keep some state of any object in memory. You can save and hold a sequence of bytes in file or memory and cast it as integer, bigint, image, Ascii string, UTF-8 string, encrypted string, or your own defined datatype after reading it from disk. So you can not say objects are something different than bytes sequence. – Mojtaba Rezaeian Feb 11 '16 at 23:00
  • Mojtaba -- I updated my answer with a wiser mind at the keyboard. However what you said isn't right for objects who have other object dependencies. But Mehrdad's solution, by converting it to an array of char, eliminates this, making what you said possible. Still trying to decide whether or not to replace my entire response.. But perhaps my learning process will have some value. – Gerard ONeill Nov 01 '17 at 19:49
8

How do I convert a string to a byte[] in .NET (C#) without manually specifying a specific encoding?

A string in .NET represents text as a sequence of UTF-16 code units, so the bytes are encoded in memory in UTF-16 already.

Mehrdad's Answer

You can use Mehrdad's answer, but it does actually use an encoding because chars are UTF-16. It calls ToCharArray which looking at the source creates a char[] and copies the memory to it directly. Then it copies the data to a byte array that is also allocated. So under the hood it is copying the underlying bytes twice and allocating a char array that is not used after the call.

Tom Blodget's Answer

Tom Blodget's answer is 20-30% faster than Mehrdad since it skips the intermediate step of allocating a char array and copying the bytes to it, but it requires you compile with the /unsafe option. If you absolutely do not want to use encoding, I think this is the way to go. If you put your encryption login inside the fixed block, you don't even need to allocate a separate byte array and copy the bytes to it.

Also, why should encoding be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?

Because that is the proper way to do it. string is an abstraction.

Using an encoding could give you trouble if you have 'strings' with invalid characters, but that shouldn't happen. If you are getting data into your string with invalid characters you are doing it wrong. You should probably be using a byte array or a Base64 encoding to start with.

If you use System.Text.Encoding.Unicode, your code will be more resilient. You don't have to worry about the endianness of the system your code will be running on. You don't need to worry if the next version of the CLR will use a different internal character encoding.

I think the question isn't why you want to worry about the encoding, but why you want to ignore it and use something else. Encoding is meant to represent the abstraction of a string in a sequence of bytes. System.Text.Encoding.Unicode will give you a little endian byte order encoding and will perform the same on every system, now and in the future.

Jason Goemaat
  • 27,053
  • 14
  • 78
  • 109
  • Actually a string in C# is NOT restricted to just UTF-16. What is true is that it contains a vector of 16-bit code units, but these 16-bit code units are not restricted to valid UTF-16. But as they are 16-bit, you need an encoding (byte order) to convert them to 8bit. A string can then store non-Unicode data, including binary code (e.g. a bitmap image). It becomes interpreted as UTF-16 only in I/O and text formatters that make such interpretation. – verdy_p Sep 07 '19 at 15:42
  • So in a C# string, you can safely store a code unit like 0xFFFF or 0xFFFE, even if they are non-characters in UTF-16, and you can store an isolated 0xD800 not followed by a code unit in 0xDC00..0xDFFF (i.e. unpaired surrogates which are invalid in UTF-16). The same remark applies to strings in Javascript/ECMAscript and Java. – verdy_p Sep 07 '19 at 15:47
  • When you use "GetBytes", of course you don't specify an encoding, but you assume a byte order to get the two bytes in a specic for each code unit stored locally in the string. When you build a new string from bytes, you also need a converter, not necessarily UTF-8 to UTF-16, you could insert the extra 0 in the high byte, or pack two bytes (in MSB first or LSB first order) in the same 16-bit code unit. Strings are then compact form for arrays of 16-bit integers. The relation with "characters" is another problem, in C# they're not actual types as they are still represented as strings – verdy_p Sep 07 '19 at 15:55
6

You can use following code to convert a string to a byte array in .NET

string s_unicode = "abcéabc";
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);
İlker Elçora
  • 570
  • 4
  • 13
Shyam sundar shah
  • 2,284
  • 1
  • 21
  • 35
4

If you really want a copy of the underlying bytes of a string, you can use a function like the one that follows. However, you shouldn't please read on to find out why.

[DllImport(
        "msvcrt.dll",
        EntryPoint = "memcpy",
        CallingConvention = CallingConvention.Cdecl,
        SetLastError = false)]
private static extern unsafe void* UnsafeMemoryCopy(
    void* destination,
    void* source,
    uint count);

public static byte[] GetUnderlyingBytes(string source)
{
    var length = source.Length * sizeof(char);
    var result = new byte[length];
    unsafe
    {
        fixed (char* firstSourceChar = source)
        fixed (byte* firstDestination = result)
        {
            var firstSource = (byte*)firstSourceChar;
            UnsafeMemoryCopy(
                firstDestination,
                firstSource,
                (uint)length);
        }
    }

    return result;
}

This function will get you a copy of the bytes underlying your string, pretty quickly. You'll get those bytes in whatever way they are encoding on your system. This encoding is almost certainly UTF-16LE but that is an implementation detail you shouldn't have to care about.

It would be safer, simpler and more reliable to just call,

System.Text.Encoding.Unicode.GetBytes()

In all likelihood this will give the same result, is easier to type, and the bytes will round-trip, as well as a byte representation in Unicode can, with a call to

System.Text.Encoding.Unicode.GetString()
Jodrell
  • 31,518
  • 3
  • 75
  • 114
  • As mentioned in many other comments, `Unicode.GetBytes()` / `Unicode.GetString()` does NOT round-trip for all .NET `string` instances. – Ben Voigt Dec 07 '20 at 23:13
  • @BenVoigt, I tweaked the answer. I'd do something less Windows specific these days. – Jodrell Dec 09 '20 at 12:12
  • You might consider avoiding p/invoke for that, `Marshal.Copy` will work fine for copying from a pointer to a byte array. https://stackoverflow.com/a/54453180/103167 – Ben Voigt Dec 09 '20 at 16:45
  • @BenVoigt or even https://stackoverflow.com/a/48195448/659190 – Jodrell Dec 10 '20 at 08:39
4

Upon being asked what you intend to do with the bytes, you responded:

I'm going to encrypt it. I can encrypt it without converting but I'd still like to know why encoding comes to play here. Just give me the bytes is what I say.

Regardless of whether you intend to send this encrypted data over the network, load it back into memory later, or steam it to another process, you are clearly intending to decrypt it at some point. In that case, the answer is that you're defining a communication protocol. A communication protocol should not be defined in terms of implementation details of your programming language and its associated runtime. There are several reasons for this:

  • You may need to communicate with a process implemented in a different language or runtime. (This might include a server running on another machine or sending the string to a JavaScript browser client, for example.)
  • The program may be re-implemented in a different language or runtime in the future.
  • The .NET implementation might change the internal representation of strings. You may think this sounds farfetched, but this actually happened in Java 9 to reduce memory usage. There's no reason .NET couldn't follow suit. Skeet suggests that UTF-16 probably isn't optimal today give the rise of the emoji and other blocks of Unicode needing more than 2 bytes to represent as well, increasing the likelihood that the internal representation could change in the future.

For communicating (either with a completely disparate process or with the same program in the future), you need to define your protocol strictly to minimize the difficulty of working with it or accidentally creating bugs. Depending on .NET's internal representation is not a strict, clear, or even guaranteed to be consistent definition. A standard encoding is a strict definition that will not fail you in the future.

In other words, you can't satisfy your requirement for consistency without specifying an encoding.

You may certainly choose to use UTF-16 directly if you find that your process performs significantly better since .NET uses it internally or for any other reason, but you need to choose that encoding explicitly and perform those conversions explicitly in your code rather than depending on .NET's internal implementation.

So choose an encoding and use it:

using System.Text;

// ...

Encoding.Unicode.GetBytes("abc"); # UTF-16 little endian
Encoding.UTF8.GetBytes("abc")

As you can see, it's also actually less code to just use the built in encoding objects than to implement your own reader/writer methods.

jpmc26
  • 23,237
  • 9
  • 76
  • 129
3

Here is my unsafe implementation of String to Byte[] conversion:

public static unsafe Byte[] GetBytes(String s)
{
    Int32 length = s.Length * sizeof(Char);
    Byte[] bytes = new Byte[length];

    fixed (Char* pInput = s)
    fixed (Byte* pBytes = bytes)
    {
        Byte* source = (Byte*)pInput;
        Byte* destination = pBytes;

        if (length >= 16)
        {
            do
            {
                *((Int64*)destination) = *((Int64*)source);
                *((Int64*)(destination + 8)) = *((Int64*)(source + 8));

                source += 16;
                destination += 16;
            }
            while ((length -= 16) >= 16);
        }

        if (length > 0)
        {
            if ((length & 8) != 0)
            {
                *((Int64*)destination) = *((Int64*)source);

                source += 8;
                destination += 8;
            }

            if ((length & 4) != 0)
            {
                *((Int32*)destination) = *((Int32*)source);

                source += 4;
                destination += 4;
            }

            if ((length & 2) != 0)
            {
                *((Int16*)destination) = *((Int16*)source);

                source += 2;
                destination += 2;
            }

            if ((length & 1) != 0)
            {
                ++source;
                ++destination;

                destination[0] = source[0];
            }
        }
    }

    return bytes;
}

It's way faster than the accepted anwser's one, even if not as elegant as it is. Here are my Stopwatch benchmarks over 10000000 iterations:

[Second String: Length 20]
Buffer.BlockCopy: 746ms
Unsafe: 557ms

[Second String: Length 50]
Buffer.BlockCopy: 861ms
Unsafe: 753ms

[Third String: Length 100]
Buffer.BlockCopy: 1250ms
Unsafe: 1063ms

In order to use it, you have to tick "Allow Unsafe Code" in your project build properties. As per .NET Framework 3.5, this method can also be used as String extension:

public static unsafe class StringExtensions
{
    public static Byte[] ToByteArray(this String s)
    {
        // Method Code
    }
}
iliketocode
  • 6,652
  • 4
  • 41
  • 57
Tommaso Belluzzo
  • 21,428
  • 7
  • 63
  • 89
  • Is the value of `RuntimeHelpers.OffsetToStringData` a multiple of 8 on the Itanium versions of .NET? Because otherwise this will fail due to the unaligned reads. – Jon Hanna Jan 06 '14 at 14:09
  • wouldn't it be simpler to invoke `memcpy`? http://stackoverflow.com/a/27124232/659190 – Jodrell Nov 25 '14 at 10:33
2

The string can be converted to byte array in few different ways, due to the following fact: .NET supports Unicode, and Unicode standardizes several difference encodings called UTFs. They have different lengths of byte representation but are equivalent in that sense that when a string is encoded, it can be coded back to the string, but if the string is encoded with one UTF and decoded in the assumption of different UTF if can be screwed up.

Also, .NET supports non-Unicode encodings, but they are not valid in general case (will be valid only if a limited sub-set of Unicode code point is used in an actual string, such as ASCII). Internally, .NET supports UTF-16, but for stream representation, UTF-8 is usually used. It is also a standard-de-facto for Internet.

Not surprisingly, serialization of string into an array of byte and deserialization is supported by the class System.Text.Encoding, which is an abstract class; its derived classes support concrete encodings: ASCIIEncoding and four UTFs (System.Text.UnicodeEncoding supports UTF-16)

Ref this link.

For serialization to an array of bytes using System.Text.Encoding.GetBytes. For the inverse operation use System.Text.Encoding.GetChars. This function returns an array of characters, so to get a string, use a string constructor System.String(char[]).
Ref this page.

Example:

string myString = //... some string

System.Text.Encoding encoding = System.Text.Encoding.UTF8; //or some other, but prefer some UTF is Unicode is used
byte[] bytes = encoding.GetBytes(myString);

//next lines are written in response to a follow-up questions:

myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);
myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);

//how many times shall I repeat it to show there is a round-trip? :-)
Bharat Mane
  • 266
  • 2
  • 10
  • 20
Vijay Singh Rana
  • 930
  • 11
  • 32
2

It depends on what you want the bytes FOR

This is because, as Tyler so aptly said, "Strings aren't pure data. They also have information." In this case, the information is an encoding that was assumed when the string was created.

Assuming that you have binary data (rather than text) stored in a string

This is based off of OP's comment on his own question, and is the correct question if I understand OP's hints at the use-case.

Storing binary data in strings is probably the wrong approach because of the assumed encoding mentioned above! Whatever program or library stored that binary data in a string (instead of a byte[] array which would have been more appropriate) has already lost the battle before it has begun. If they are sending the bytes to you in a REST request/response or anything that must transmit strings, Base64 would be the right approach.

If you have a text string with an unknown encoding

Everybody else answered this incorrect question incorrectly.

If the string looks good as-is, just pick an encoding (preferably one starting with UTF), use the corresponding System.Text.Encoding.???.GetBytes() function, and tell whoever you give the bytes to which encoding you picked.

NH.
  • 1,858
  • 2
  • 22
  • 35
2

If you are using .NET Core or System.Memory for .NET Framework, there is a very efficient marshaling mechanism available via Span<T> and Memory<T> that can effectively reinterpret string memory as a span of bytes. Once you have a span of bytes you are free to marshal back to another type, or copy the span to an array for serialization.

To summarize what others have said:

  • Storing a representation of this kind of serialization is sensitive to system endianness, compiler optimizations, and changes to the internal representation of strings in the executing .NET Runtime.
    • Avoid long-term storage
    • Avoid deserializing or interpreting the string in other environments
      • This includes other machines, processor architectures, .NET runtimes, containers, etc.
      • This includes comparisons, formatting, encryption, string manipulation, localization, character transforms, etc.
    • Avoid making assumptions about the character encoding
      • The default encoding tends to be UTF-16LE in practice, but the compiler / runtime can choose any internal representation

Implementation

public static class MarshalExtensions
{
   public static ReadOnlySpan<byte> AsBytes(this string value) => MemoryMarshal.AsBytes(value.AsSpan());
   public static string AsString(this ReadOnlySpan<byte> value) => new string(MemoryMarshal.Cast<byte, char>(value));
}

Example

static void Main(string[] args)
{
    string str1 = "你好,世界";
    ReadOnlySpan<byte> span = str1.AsBytes();
    string str2 = span.AsString();

    byte[] bytes = span.ToArray();

    Debug.Assert(bytes.Length > 0);
    Debug.Assert(str1 == str2);
}

Furthur Insight

In C++ this is roughly equivalent to reinterpret_cast, and C this is roughly equivalent to a cast to the system's word type (char).

In recent versions of the .NET Core Runtime (CoreCLR), operations on spans effectively invoke compiler intrinsics and various optimizations that can sometimes eliminate bounds checking, leading to exceptional performance while preserving memory safety, assuming that your memory was allocated by the CLR and the spans are not derived from pointers from an unmanaged memory allocator.

Caveats

This uses a mechanism supported by the CLR that returns ReadOnlySpan<char> from a string; Additionally, this span does not necessarily encompass the complete internal string layout. ReadOnlySpan<T> implies that you must create a copy if you need to perform mutation, as strings are immutable.

Chris Hutchinson
  • 8,422
  • 3
  • 22
  • 33
  • Some commentary: despite what appears to be the popular opinion, an entirely valid use-case for this mechanism is runtime encryption: extract byte representation, encrypt bytes, and keep the encrypted payload in memory. This minimizes encoding overhead, and as long as it's not serialized and transferred to another environment, will not suffer from any encoding-specific issues due to interpretation semantics or internal representation. There's an argument for using **SecureString** for this purpose, and concerns about garbage collection, but otherwise the premise appears sound. – Chris Hutchinson Aug 03 '20 at 21:23
  • There is at least one proposal for CoreCLR to introduce a more compact internal representation: https://github.com/dotnet/runtime/issues/6612 – Chris Hutchinson Aug 03 '20 at 21:26
1

Simply use this:

byte[] myByte= System.Text.ASCIIEncoding.Default.GetBytes(myString);
jonsca
  • 9,342
  • 26
  • 53
  • 60
alireza amini
  • 1,616
  • 1
  • 16
  • 33
  • 2
    ...and lose all characters with a jump cope higher than 127. In my native language it is perfectly valid to write "Árvíztűrő tükörfúrógép.". `System.Text.ASCIIEncoding.Default.GetBytes("Árvíztűrő tükörfúrógép.").ToString();` will return `"Árvizturo tukörfurogép."` losing information which can not be retrieved. (And I didn't yet mention asian languages where you would loose all characters.) – mg30rg Jan 11 '18 at 15:09
1
bytes[] buffer = UnicodeEncoding.UTF8.GetBytes(string something); //for converting to UTF then get its bytes

bytes[] buffer = ASCIIEncoding.ASCII.GetBytes(string something); //for converting to ascii then get its bytes
user1120193
  • 242
  • 3
  • 11
0

simple code with LINQ

string s = "abc"
byte[] b = s.Select(e => (byte)e).ToArray();

EDIT : as commented below, it is not a good way.

but you can still use it to understand LINQ with a more appropriate coding :

string s = "abc"
byte[] b = s.Cast<byte>().ToArray();
Avlin
  • 490
  • 4
  • 18
  • 2
    It's hardly _more faster_, let alone _most fastest_. It's certainly an interesting alternative, but it's essentially the same as `Encoding.Default.GetBytes(s)` which, by the way, is _way faster_. Quick testing suggests that `Encoding.Default.GetBytes(s)` performs at least 79% faster. YMMV. – WynandB Oct 25 '13 at 04:36
  • 6
    Try it with a `€`. This code will not crash, but will return a **wrong result** (which is even worse). Try casting to a `short` instead of `byte` to see the difference. – Hans Kesting Dec 18 '13 at 08:57
0

A character is both a lookup key into a font table and a lexical tradition such as ordering, upper and lower case versions, etc.

Consequently, a character is not a byte (8-bits) and a byte is not a character. In particular, the 256 permutations of a byte cannot accommodate the thousands of symbols within some written languages, much less all languages. Hence, various methods for encoding characters have been devised. Some encode for a particular class of languages (ASCII encoding); multiple languages using code pages (Extended ASCII); or, ambitiously, all languages by selectively including additional bytes as needed, Unicode.

Within a system, such as the .NET framework, a String implies a particular character encoding. In .NET this encoding is Unicode. Since the framework reads and writes Unicode by default, dealing with character encoding is typically not necessary in .NET.

However, in general, to load a character string into the system from a byte stream you need to know the source encoding to therefore interpret and subsequently translate it correctly (otherwise the codes will be taken as already being in the system's default encoding and thus render gibberish). Similarly, when a string is written to an external source, it will be written in a particular encoding.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
George
  • 1,268
  • 18
  • 28
  • 2
    Unicode is not an encoding. Unicode is an abstract mapping of characters to codepoints. There are multiple ways of encoding Unicode; in particular, UTF-8 and UTF-16 are most common. .NET uses UTF-16, though I'm unsure if it's UTF-16 LE or UTF-16 BE. – Kevin Aug 26 '17 at 03:22
  • UTF-16 LE or UTF-16 BE is nor relevant: strings are using unbreakable 16-bit code units without any interpretation. UTF-16BE or UTF-16 LE may become relevant only when you convert strings to byte arrays or the reverse because, at that time, you'll specify an encoding (and in that case the string must first be valid UTF-16, but strings don't have to be valid UTF-16). GetBytes() is not necessarily returning valid UTF-16 BE/LE, it uses a simple arithmetic; the returned array is also not valid UTF-8 but arbitrary bytes. The byte order in result is system-specific if no encoding is specified. – verdy_p Sep 07 '19 at 16:05
  • This also means that string.UTF8.getBytes() may throw encoding exceptions from arbitrary strings whose content is not valid UTF-16. In C# you have the choice of encoders/decoders (codec) to use. You may want to use your own codec which will pack/unpack bytes differently, or may silently drop unpaired surrogates (if the codec attempts to interpret the string as UTF-16), or may drop the high bytes, or replace/interpret the codeunits invalid in UTF-16 by U+FFFD. The codec may also use data compression, or hexadecimal/base64 or escaping...Codecs are not restricted to just the UTF8 encoding. – verdy_p Sep 07 '19 at 16:15
  • note: I use here the term "codec" voluntarily instead of "encoding" which is more specific and used only for text. strings in C#, C, C++, Java, Javascript/ECMAscript/ActiveScript are NOT restricted to just valid text: they are just a generic storage structure, convenient for text and treated as text by libraries (but not all). As such the UTF forms are not enforced at all except inside specific APIs using them (including UTF* encoding objects). Yes you can store a binary program or PNG image in a compact immutable string instead of mutable array, but you can I/O all strings to text channels – verdy_p Sep 07 '19 at 18:50
0

I have written a Visual Basic extension similar to the accepted answer, but directly using .NET memory and Marshalling for conversion, and it supports character ranges unsupported in other methods, like UnicodeEncoding.UTF8.GetString or UnicodeEncoding.UTF32.GetString or even MemoryStream and BinaryFormatter (invalid characters like: & ChrW(55906) & ChrW(55655)):

<Extension> _
Public Function ToBytesMarshal(ByRef str As String) As Byte()
    Dim gch As GCHandle = GCHandle.Alloc(str, GCHandleType.Pinned)
    Dim handle As IntPtr = gch.AddrOfPinnedObject
    ToBytesMarshal = New Byte(str.Length * 2 - 1) {}
    Try
        For i As Integer = 0 To ToBytesMarshal.Length - 1
            ToBytesMarshal.SetValue(Marshal.ReadByte(IntPtr.Add(handle, i)), i)
        Next
    Finally
        gch.Free()
    End Try
End Function

<Extension> _
Public Function ToStringMarshal(ByRef arr As Byte()) As String
    Dim gch As GCHandle = GCHandle.Alloc(arr, GCHandleType.Pinned)
    Try
        ToStringMarshal = Marshal.PtrToStringAuto(gch.AddrOfPinnedObject)
    Finally
        gch.Free()
    End Try
End Function
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Mojtaba Rezaeian
  • 6,543
  • 5
  • 26
  • 48
0

Two ways:

public static byte[] StrToByteArray(this string s)
{
    List<byte> value = new List<byte>();
    foreach (char c in s.ToCharArray())
        value.Add(c.ToByte());
    return value.ToArray();
}

And,

public static byte[] StrToByteArray(this string s)
{
    s = s.Replace(" ", string.Empty);
    byte[] buffer = new byte[s.Length / 2];
    for (int i = 0; i < s.Length; i += 2)
        buffer[i / 2] = (byte)Convert.ToByte(s.Substring(i, 2), 16);
    return buffer;
}

I tend to use the bottom one more often than the top, haven't benchmarked them for speed.

  • 4
    What about multibyte characters? – Agnel Kurian Feb 23 '09 at 09:57
  • @AgnelKurian [Msdn says](https://msdn.microsoft.com/en-us/library/ee6e613x(v=vs.110).aspx) _"This method returns an unsigned byte value that represents the numeric code of the Char object passed to it. In the .NET Framework, a Char object is a 16-bit value. This means that the method is suitable for returning the numeric codes of characters in the ASCII character range or in the Unicode C0 Controls and Basic Latin, and C1 Controls and Latin-1 Supplement ranges, from U+0000 to U+00FF."_ – mg30rg Jan 11 '18 at 11:30
-1

To convert a string to a byte[] use the following solution:

string s = "abcdefghijklmnopqrstuvwxyz";
byte[] b = System.Text.UTF32Encoding.GetBytes(s);

I hope it helps.

WonderWorker
  • 7,388
  • 3
  • 52
  • 71
  • 2
    that's not a solution! – Sebastian Apr 12 '14 at 17:12
  • 1
    Before your edit it was: `s.Select(e => (byte)e)` this only works for ASCII characters. But the `char` type is for storing UTF16 Units. Now after your editing, the code is at least correct, but it varies from environment to environment, hence rendering it virtually useless. IMHO Encoding.Default should only be used for interacting with legacy Windows "Ansi codepage" code. – Sebastian Apr 13 '14 at 08:04
  • Good point. How do you feel about byte[] b = new System.Text.UTF32Encoding().GetBytes(s); ? – WonderWorker Apr 14 '14 at 08:30
  • use `byte[] b = System.Text.UTF32Encoding.GetBytes(s);`, UTF8 is equally fine. – Sebastian Apr 14 '14 at 09:12
-1

From byte[] to string:

        return BitConverter.ToString(bytes);
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Piero Alberto
  • 3,403
  • 5
  • 46
  • 91
-2
// C# to convert a string to a byte array.
public static byte[] StrToByteArray(string str)
{
    System.Text.ASCIIEncoding  encoding=new System.Text.ASCIIEncoding();
    return encoding.GetBytes(str);
}


// C# to convert a byte array to a string.
byte [] dBytes = ...
string str;
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
str = enc.GetString(dBytes);
cyberbobcat
  • 1,159
  • 1
  • 18
  • 34
  • 7
    1) That will lose data due to using ASCII as the encoding. 2) There's no point in creating a new ASCIIEncoding - just use the Encoding.ASCII property. – Jon Skeet Jan 27 '09 at 06:35
-4

Here is the code:

// Input string.
const string input = "Dot Net Perls";

// Invoke GetBytes method.
// ... You can store this array as a field!
byte[] array = Encoding.ASCII.GetBytes(input);

// Loop through contents of the array.
foreach (byte element in array)
{
    Console.WriteLine("{0} = {1}", element, (char)element);
}
shytikov
  • 8,154
  • 7
  • 50
  • 90
-5

I had to convert a string to a byte array for a serial communication project - I had to handle 8-bit characters, and I was unable to find a method using the framework converters to do so that didn't either add two-byte entries or mis-translate the bytes with the eighth bit set. So I did the following, which works:

string message = "This is a message.";
byte[] bytes = new byte[message.Length];
for (int i = 0; i < message.Length; i++)
    bytes[i] = (byte)message[i];
IgnusFast
  • 59
  • 7
  • 3
    Its not safe this way and you will loose original data if input string contains unicode range characters. – Mojtaba Rezaeian Feb 11 '16 at 19:43
  • This was for a serial communication project, which couldn't handle unicode anyway. Granted that it was an extremely narrow case. – IgnusFast Feb 06 '17 at 20:55
-12

OP's question: "How do I convert a string to a byte array in .NET (C#)?" [sic]

You can use the following code:

static byte[] ConvertString (string s) {
    return new byte[0];
}

As a benefit, encoding does not matter! Oh wait, this is an ecoding... it's just trivial and highly lossy.

Thomas Eding
  • 31,027
  • 10
  • 64
  • 101