Encode unicode string as byte array C++ and C#

Question

I have C++ code which I want to rewrite to C#. This part

  case ID_TYPE_UNICODE_STRING :
      if(items[i].GetUString().length() > 0xFFFF)
        throw dppError("error");
      //GetUstring returns std::wstring type object
      DataSize = (WORD) (sizeof(WCHAR)*(items[i].GetUString().length()));  
      blob.AppendData((const BYTE *) &DataSize, sizeof(WORD)); //blob is byte array 
      //GetUstring returns std::wstring type object
      blob.AppendData((const BYTE *) items[i].GetUString().c_str(), DataSize); 
      break ;

basically serializes length in bytes of unicode string and string itself to byte array.

Here comes my problem (this code then sends this data to server). I don't know which encoding is used in above lines of code(UTF16, UTF8, etc.). So I don't know what is the best way to reimplement it in C#. How can I guess what encoding is used in this C++ project?

And if I can't find encoding used in C++ project, given endianness is same as stated in accepted answer of this question, do you think the two methods (GetBytes and GetString) in accepted answer will work for me (for serializing the unicode string as in C++ project and retrieving it back)? e.g.

these two:

static byte[] GetBytes(string str)
{
    byte[] bytes = new byte[str.Length * sizeof(char)];
    System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

static string GetString(byte[] bytes)
{
    char[] chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

Or I am better of to learn what is the encoding used in C++ project?

I will then need to reconstruct the string in the same way from byte array too. And if I am better of learning which encoding was used in C++, how do I get the length of the string in bytes in C#, using System.Text.ASCII.WhateverEncodingWasUsedinC++.GetByteCount(string); ??

PS. Do you think the C++ code is working in encoding agnostic way? If yes, how can I repeat that also in C#?

UPDATE: I am guessing the encoding used is UTF16 because I saw that being mentioned in several variables names, so I think I will assume UTF16 is used, and if something doesn't work out during testing, look for alternative solutions. In that case, what is the best way to get the number of bytes of the UTF16 string? Is following method OK: System.Text.ASCII.Unicode.GetByteCount(string); ??

feedback and comments welcome. Am I wrong somewhere in my reasoning? Thanks

There's a lot of stuff here which comes from a library, like `GetUString`, `Blob`, `ID_TYPE_UNICODE_STRING`. What library are you using? You will likely find the answers to those questions in the documentation of that library. — roeland, Oct 13 '15 at 01:53
@roeland: this library is from the C++ project, written by developers who wrote also this project — , Oct 13 '15 at 08:29
Since they are using `sizeof(WCHAR)` and the length of that *UString* when copying, probably they are writing UTF-16 with native endianness (little-endian on Intel) into the byte array. — roeland, Oct 13 '15 at 20:57

sujith karivelil · Answer 1 · 2015-10-12T12:41:28.993

0

Change the method signature as like this for getting byte[] equivalent of input string.

static byte[] GetBytes(string str)
{
   UnicodeEncoding uEncoding = new UnicodeEncoding();
   byte[] stringContentBytes = uEncoding.GetBytes("Your string");
   return stringContentBytes;
}

For reverse:

static string GetString(byte[] bytes)
{
    UnicodeEncoding uEncoding = new UnicodeEncoding();
    string stringContent=uEncoding.GetString(bytes);
    return new string(stringContent);
}

edited Oct 12 '15 at 12:41

answered Oct 12 '15 at 12:37

sujith karivelil

26,861
6
46
76

Yeah but that assumes UTF16 encoding right? I said I don't know which encoding is used by the C++ project – Oct 12 '15 at 12:40

Encode unicode string as byte array C++ and C#

1 Answers1