Java internal String representation: is it UTF-16?

Question

I have found on SO, that Java strings are represented as UTF-16 internally. Out of curiosity I have developed and ran following snippet (Java 7):

public class StringExperiment {
    public static void main(String...args) throws UnsupportedEncodingException {
        System.out.println(Arrays.toString("ABC".getBytes()));
    }
}

which resulted in:

[65, 66, 67]

being printed to the console output.

How does it match with UTF-16?

Update. Is there a way to write a program that prints internal bytes of the string as is?

http://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16 — Dia, Mar 02 '15 at 11:04
@Dia How does it go along with *"Java stores strings internally as UTF-16 and uses 2 bytes for each character."*? — Denis Kulagin, Mar 02 '15 at 11:05
See [Java Language Specification par. 3.1](http://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1) — Jesper, Mar 02 '15 at 11:07

fge · Answer 1 · 2015-03-02T11:17:27.447

You seem to be misunderstanding something.

For all the system cares, and, MOST OF THE TIME, the developer cares, chars could as well be carrier pigeons, and Strings sequence of said carrier pigeons. Although yes, internally, strings are sequences of chars (which are more precisely UTF-16 code units), this is not the problem at hand here.

You don't write chars into files, neither do you read chars from files. You write, and read, bytes.

And in order to read a sequence of bytes as a sequence of chars/carrier pigeons, you need a decoder; similarly (and this is what you do here), in order to turn chars/carrier pigeons into bytes, you need an encoder. In Java, both of these are available from a Charset.

String.getBytes() just happens to use an encoder with the default platform character coding (obtained using Charset.defaultCharset()), and it happens that for your input string "ABC" and your JRE implementation, the sequence of bytes generated is 65, 66, 67. Hence the result.

Now, try and String.getBytes(Charset.forName("UTF-32LE")), and you'll get a different result.

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

3

Java's internal string-representation is based on their char and thus UTF-16.
Unless it isn't: A modern VM (since Java 6 Update 21 Performance Release) might try to save space by using basic ASCII (single-byte-encoding) where that suffices.

And serialization / java-native-interface is done in a modified CESU-8 (a surrogate-agnostic variant of UTF-8) encoding, with NUL represented as two bytes to avoid embedded zeroes.

All of that is irrelevant for your "test" though:
You are asking Java to encode the string in the platform's default-charset, and that's not the internal charset:

public byte[] getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
The behavior of this method when this string cannot be encoded in the default charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.

edited Jun 20 '20 at 09:12

Community

1
1

answered Mar 02 '15 at 11:20

Deduplicator

41,806
6
61
104

Hmm, I wasn't aware that such an option existed; it may have quite an adverse effect on performance though... – fge Mar 02 '15 at 11:24
@fge: Yes, as much as using UTF-16 instead of UTF-8 for strings... though they presumably tested and found it a worthwhile optimization for at least *some* workloads at Oracle. – Deduplicator Mar 02 '15 at 11:26
Well, the thing is that when Java appeared, Unicode only had the BMP so they elected to use an unsigned 16bit char as a universal way to store characters :) But since then, of course, things changed... Honestly, internally storing as UTF-8 would be worse performance-wise. – fge Mar 02 '15 at 12:05
@fge: Sure, but only because they are doomed to have UTF-16 codepoint random-access, due to that history. – Deduplicator Mar 02 '15 at 12:14
See also https://stackoverflow.com/questions/8833385/support-for-compressed-strings-being-dropped-in-hotspot-jvm re: the coming and going of the -XX:+UseCompressedStrings option – DNA Jul 27 '17 at 14:24

DNA · Answer 3 · 2015-03-02T11:16:06.037

Java Strings are indeed represented as UTF-16 internally, but you are calling the getBytes method, which does the following (my emphasis)

public byte[] getBytes()

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

And your platform's default encoding is probably not UTF-16.

If you use the variant that lets you specify an encoding, you can see how the string would look in other encodings:

public byte[] getBytes(Charset charset)

If you look at the source code for java.lang.String, you can see that the String is stored internally as an array of (16-bit) chars.

Java internal String representation: is it UTF-16?

3 Answers3