Bytearray from string

Question

I have converted a string which has foreign character - 晝 to byte array. Byte array can store values between -128 to 127, so corresponding value has been stored as 3 bytes ---> -26,-103,-99.

Here's the conversion code:

String str = "晝"; 
byte[] b = str.getBytes(); 

for(byte bt : b) 
    System.out.println(bt); 

String str1 = new String(b);
System.out.println(str1);

Can you please clarify how this 3 bytes has been calculated for the foreign character

How did you converted your String? [This](https://docs.oracle.com/javase/tutorial/i18n/text/string.html) would help — SMA, Dec 27 '15 at 07:28
{String str = "晝"; byte[] b = str.getBytes(); for(byte bt : b) System.out.println(bt); String str1 = new String(b); System.out.println(str1);} — Senthil, Dec 27 '15 at 09:44
Possible duplicate of [How does UTF-8 "variable-width encoding" work?](http://stackoverflow.com/questions/1543613/how-does-utf-8-variable-width-encoding-work) — Joe, Dec 30 '15 at 14:50

score 3 · Answer 1 · answered Dec 27 '15 at 07:32

晝 is U+665D. It looks like when you converted it, you converted it to UTF-8. UTF-8 is a variable length encoding of Unicode characters. Characters in [U+0800, U+FFFF] are converted to 3 bytes.

According to this converter, U+665D is E6 99 9D in UTF-8 (in hex, 230 153 157 in decimal, which will be needed in a bit). Because a byte is -128 to 127, values larger than 127 are displayed as the number less 256, so as bytes, 230 153 157 is 230-256 153-256 157-256, or -26 -103 -99, which is what you're seeing.

score 1 · Answer 2 · answered Dec 27 '15 at 07:35

1

All conversions from characters to bytes uses some character set do do the encoding.

You don't say, but I assume that you did the conversion using String.getBytes(). This is simply a shortcut for String.getBytes(Charset.defaultCharset()) and the default Charset depends on your particular Java environment. The three values you report are (in hex) 0xE6 0x99 0x9D. which is the UTF-8 encoding of U+665D (Unicode Han Character 'daytime, daylight'). Since that's the character that you report having started with, presumably the default character set for your environment is UTF-8 (which is not a surprise, but not something you can count on everywhere).

answered Dec 27 '15 at 07:35

Ted Hopp

222,293
47
371
489

Thanks for clarifying – Senthil Dec 27 '15 at 07:48
Here is the code I used {String str = "晝"; byte[] b = str.getBytes(); for(byte bt : b) System.out.println(bt); String str1 = new String(b); System.out.println(str1);} I haven't specified encoding in the code. Eclipse shows default cp1252 encoding. I am executing the code from Widows server. From where does UTF-8 is picked up...can you please help to understand? – Senthil Dec 27 '15 at 08:59
@Senthil - That's pretty much what I thought you were doing. It's using the default character set for your environment, which is most likely UTF-8 (or at least agrees with UTF-8 for that particular character). You can report your default charset using `System.out.println(Charset.defaultCharset().toString());` – Ted Hopp Dec 27 '15 at 09:04
Yes,thanks Ted Hopp. UTF-8 is the default character set picked from Windows 2008 server ,it seems. If I wish to change it to different encoding, from where should I do it? – Senthil Dec 27 '15 at 09:38

Bytearray from string

2 Answers2