0

how can I convert the utf-8 characters in a string into an array of sbytes and back ? I can't seem to find a fitting method in Encoding.UTF. Thanks

Edit: To clarify. I don't want an array of bytes. I want an array of UTF-8 characters.

Edit: I just realized I can iterate the string and convert all chars into ints to get their int32 representation. Is it possible to use UTF-8 instead ?

pixartist
  • 1,099
  • 2
  • 17
  • 36
  • 1
    How your "utf-8characters" are represented? Since C# char/strings is not utf-8 I assume you have either byte array (but you'd not ask such question) OR you have the characters in file or stream... If your characters are in stream/file than just directly calling `Stream.Read` or `File.ReadAllBytes` would be the solution. Please clarify. – Alexei Levenkov Jan 11 '15 at 05:25
  • It's a WPF-Textbox string with UTF-8 characters in it. VS represents them correctly in debug, so I assumed it would use UTF-8. What does it use ? UTF-32 ? – pixartist Jan 11 '15 at 12:30
  • See http://stackoverflow.com/questions/472906/converting-a-string-to-byte-array-without-using-an-encoding-byte-by-byte – Mihai8 Jan 11 '15 at 12:50
  • 1
    I'm not convinced you know what UTF-8 is. Look it up to make sure. It's not identical to Unicode. – usr Jan 11 '15 at 13:34
  • I know what UTF-8 is. It encodes characters with a variable byte length to stay compatible to ascii. – pixartist Jan 11 '15 at 14:45
  • Hm so you are not interested in getting UTF-8 bytes. What do you need UTF-8 for, then? There is no such things as a "UTF-8 character". – usr Jan 11 '15 at 14:48
  • I'm at complete loss what you are looking for. Would you mind to show expected result for C# string: "HellФ"? – Alexei Levenkov Jan 11 '15 at 21:00
  • @pixartist, for example if input is "中文" (two Chinese characters), which one of the following are you expecting? 1) "E4B8ADE69687". 2) "5Lit5paH". The two characters are splitted into 6 bytes in UTF-8. First representation is UTF-8's bytes representation. Second is called `Base64`. – jiulongw Jan 12 '15 at 05:12

2 Answers2

2

A string in C# in UCS-2 (16 bits) which is very close to UTF-16.

To convert a c# string to UTF-8, do the following:

var s = "plain text";
var encoded = Encoding.UTF8.GetBytes(s);
var decoded = Encoding.UTF8.GetString(encoded);
Richard Schneider
  • 33,296
  • 8
  • 52
  • 68
  • As part of the question, to convert it back, use `var back = Encoding.UTF8.GetString(encoded)` – jiulongw Jan 11 '15 at 05:50
  • I don't understand. GetBytes gives me a byte array, but since utf-8 characters can be much longer than a byte, they would stretch over several bytes. What I need is an array of numerical representations of the byte values of each character of a utf-8 string, not of every byte. – pixartist Jan 11 '15 at 12:32
0

It seems you want not characters but code points. In that case, look at this SO answer.

This code:

static IEnumerable<int> AsCodePoints(this string s)
{
    for(int i = 0; i < s.Length; ++i)
    {
        yield return char.ConvertToUtf32(s, i);
        if(char.IsHighSurrogate(s, i))
            i++;
    }
}

Allows you to iterate over every code point of your string. If you want, you can encode every code point into an UTF-8 byte array.

Btw.: You cannot have "an array of UTF-8 characters" because there is no data type for a UTF-8 character. The best you can get is char (UCS-2 code unit) or a byte[] which is the UTF-8 encoding of a code point. As UTF-8 is a convention of how to translate text into byte[], the notion of an "UTF-8 character" seems contradictory.

Community
  • 1
  • 1
DasKrümelmonster
  • 4,859
  • 1
  • 19
  • 43