-1

The documentation on RtlUnicodeStringToAnsiString is rather vague about its possible failures - by vague I mean it doesn't say anything about them.

I'm not exactly sure how/if it deals with different encodings, or if my understanding is so flawed that it doesn't even come into the equation, but let's assume input is UTF-16 for argument's sake.

If all the characters are within the ASCII range then there is no problem, they can just get truncated and lose the high order byte - The first 128 Unicode code points are the ASCII characters and UTF-16 encodes U+0000 to U+D7FF as numerically equal to the code points.[1][2]

Note: UNICODE_STRING has a WCHAR* Buffer, and ANSI_STRING a CHAR* Buffer, as may be expected.

[Skipping over 129-255 and locales/codepages]

What happens with characters above 255? There is an RtlUnicodeToUTF8N function so it's safe to assume it doesn't convert to UTF-8.

How about code points outside BMP (surrogate pairs and whatnot)?

I saw a function that does something like the code below:

char *pTarget = reinterpret_cast<char*>(char_str);
const WCHAR  *pSource = reinterpret_cast<const WCHAR*>(wchar_str);

for ( long i = 0; i < targetMaxSizeInBytes; i++ )
{
    *pTarget = static_cast<char>(*pSource);

    if (L'\0' == *pSource)
        break;

    pTarget++;
    pSource++;
}

This would cause problems with any non-ASCII characters, correct?

Update:

From RbMm's answer:

RtlUnicodeStringToAnsiString is shell over RtlUnicodeToMultiByteN routine

I get a little more information:

Like RtlUnicodeToMultiByteSize, RtlUnicodeToMultiByteN supports only precomposed Unicode characters that are mapped to the current system ANSI code page installed at system boot.

WideCharToMultiByte has an option to be notified if a default character is used in the conversion for a character that cannot be represented in the specified code page:

lpUsedDefaultChar [out, optional]

Pointer to a flag that indicates if the function has used a default character in the conversion. The flag is set to TRUE if one or more characters in the source string cannot be represented in the specified code page. Otherwise, the flag is set to FALSE. This parameter can be set to NULL.

However, it seems RtlUnicodeToMultiByteN, and therefore RtlUnicodeStringToAnsiString, simply don't support characters outside the current code page?

I tried a few characters and got seemingly random conversions (see below) - more importantly, I got STATUS_SUCCESS returned.

U+03A3 Σ -> 0n83 'S'
U+03A4 Τ -> 0n63 '?'
U+03A5 Υ -> 0n63 '?'
U+03A6 Φ -> 0n70 'F'
Community
  • 1
  • 1
Ramon
  • 1,014
  • 9
  • 23
  • 1
    ANSI is not ASCII have a read of https://stackoverflow.com/questions/701882/what-is-ansi-format and note that _"The translation is done with respect to the current system locale information."_ – Richard Critten Jun 08 '17 at 17:19
  • @RichardCritten I know, but ANSI is always the same as ASCII for the first 128 characters, so I "[skipped] over 129-255 and locales/codepages". – Ramon Jun 08 '17 at 17:27

2 Answers2

1

RtlUnicodeStringToAnsiString is shell over RtlUnicodeToMultiByteN routine

The RtlUnicodeToMultiByteN routine translates the specified Unicode string into a new character string, using the current system ANSI code page (ACP). The translated string is not necessarily from a multibyte character set.

so any of this routine have the same conversion as WideCharToMultiByte with CP_ACP

also exist next routines:

RtlUnicodeStringToOemString - shell over RtlUnicodeToOemN routine

The RtlUnicodeToOemN routine translates a given Unicode string to an OEM string, using the current system OEM code page.

so this routines have the same conversion as WideCharToMultiByte with CP_OEMCP

for UTF-8 convertions exist RtlUnicodeToUTF8N (converts a Unicode string to a UTF-8 string) and RtlUTF8ToUnicodeN (converts a UTF-8 string to a Unicode string. )

for custom code page you can use undocumented api

NTSYSAPI
NTSTATUS
NTAPI
RtlCustomCPToUnicodeN(
    _In_ PCPTABLEINFO CustomCP,
    _Out_writes_bytes_to_(MaxBytesInUnicodeString, *BytesInUnicodeString) PWCH UnicodeString,
    _In_ ULONG MaxBytesInUnicodeString,
    _Out_opt_ PULONG BytesInUnicodeString,
    _In_reads_bytes_(BytesInCustomCPString) PCH CustomCPString,
    _In_ ULONG BytesInCustomCPString
    );

here key point in initialize CPTABLEINFO, so you can use any USHORT CodePage; here

RbMm
  • 25,803
  • 2
  • 21
  • 40
0

Not sure if this helps, but I have used WideCharToMultiByte before to convert from UTF-16 (wchar_t*) and UTF-8 (char*), passing the CP_UTF8 as the code page.

Edit: I just noted the kernel tag. The function I quoted is in user mode (kernel32.dll), so probably not useful for kernel mode code. :(

Cerius
  • 76
  • 4
  • Yes this is running in kernel-mode drivers. I'm also concerned with understanding existing code. Thanks for the tip though :) – Ramon Jun 08 '17 at 19:01
  • No problem! I agree that the documentation of error conditions for that function are really vague. It would make sense that any UTF-16 chars above 0x007f would map to equivalents in the loaded ANSI codepage, when possible. Not sure if the UTF-16 chars (2 or 4-byte chars) that can't be mapped would be written as '?' like some Win32 functions do. – Cerius Jun 08 '17 at 19:18
  • By written as '?' you mean actually putting U+003F in place of whatever 2 or 4-byte character was there? About the only thing the documentation says is that if the return code isn't STATUS_SUCCESS then "no storage was allocated and no conversion was done", so I would hope it doesn't change characters and still return STATUS_SUCCESS. – Ramon Jun 08 '17 at 20:18
  • You're probably right that it doesn't do the conversion if it doesn't know what to do. Who knows... ;-) – Cerius Jun 09 '17 at 12:29