Convert UTF-8 string to sbyte array and back?

Question

how can I convert the utf-8 characters in a string into an array of sbytes and back ? I can't seem to find a fitting method in Encoding.UTF. Thanks

Edit: To clarify. I don't want an array of bytes. I want an array of UTF-8 characters.

Edit: I just realized I can iterate the string and convert all chars into ints to get their int32 representation. Is it possible to use UTF-8 instead ?

How your "utf-8characters" are represented? Since C# char/strings is not utf-8 I assume you have either byte array (but you'd not ask such question) OR you have the characters in file or stream... If your characters are in stream/file than just directly calling `Stream.Read` or `File.ReadAllBytes` would be the solution. Please clarify. — Alexei Levenkov, Jan 11 '15 at 05:25
It's a WPF-Textbox string with UTF-8 characters in it. VS represents them correctly in debug, so I assumed it would use UTF-8. What does it use ? UTF-32 ? — pixartist, Jan 11 '15 at 12:30
See http://stackoverflow.com/questions/472906/converting-a-string-to-byte-array-without-using-an-encoding-byte-by-byte — Mihai8, Jan 11 '15 at 12:50
I'm not convinced you know what UTF-8 is. Look it up to make sure. It's not identical to Unicode. — usr, Jan 11 '15 at 13:34
I know what UTF-8 is. It encodes characters with a variable byte length to stay compatible to ascii. — pixartist, Jan 11 '15 at 14:45
Hm so you are not interested in getting UTF-8 bytes. What do you need UTF-8 for, then? There is no such things as a "UTF-8 character". — usr, Jan 11 '15 at 14:48
I'm at complete loss what you are looking for. Would you mind to show expected result for C# string: "HellФ"? — Alexei Levenkov, Jan 11 '15 at 21:00
@pixartist, for example if input is "中文" (two Chinese characters), which one of the following are you expecting? 1) "E4B8ADE69687". 2) "5Lit5paH". The two characters are splitted into 6 bytes in UTF-8. First representation is UTF-8's bytes representation. Second is called `Base64`. — jiulongw, Jan 12 '15 at 05:12

Richard Schneider · Answer 1 · 2015-01-11T05:54:06.893

2

A string in C# in UCS-2 (16 bits) which is very close to UTF-16.

To convert a c# string to UTF-8, do the following:

var s = "plain text";
var encoded = Encoding.UTF8.GetBytes(s);
var decoded = Encoding.UTF8.GetString(encoded);

edited Jan 11 '15 at 05:54

answered Jan 11 '15 at 03:55

Richard Schneider

33,296
8
52
68

As part of the question, to convert it back, use `var back = Encoding.UTF8.GetString(encoded)` – jiulongw Jan 11 '15 at 05:50
I don't understand. GetBytes gives me a byte array, but since utf-8 characters can be much longer than a byte, they would stretch over several bytes. What I need is an array of numerical representations of the byte values of each character of a utf-8 string, not of every byte. – pixartist Jan 11 '15 at 12:32

score 0 · Answer 2 · edited May 23 '17 at 12:27

It seems you want not characters but code points. In that case, look at this SO answer.

This code:

static IEnumerable<int> AsCodePoints(this string s)
{
    for(int i = 0; i < s.Length; ++i)
    {
        yield return char.ConvertToUtf32(s, i);
        if(char.IsHighSurrogate(s, i))
            i++;
    }
}

Allows you to iterate over every code point of your string. If you want, you can encode every code point into an UTF-8 byte array.

Btw.: You cannot have "an array of UTF-8 characters" because there is no data type for a UTF-8 character. The best you can get is char (UCS-2 code unit) or a byte[] which is the UTF-8 encoding of a code point. As UTF-8 is a convention of how to translate text into byte[], the notion of an "UTF-8 character" seems contradictory.

Convert UTF-8 string to sbyte array and back?

2 Answers2