2

I have converted a string which has foreign character - 晝 to byte array. Byte array can store values between -128 to 127, so corresponding value has been stored as 3 bytes ---> -26,-103,-99.

Here's the conversion code:

String str = "晝"; 
byte[] b = str.getBytes(); 

for(byte bt : b) 
    System.out.println(bt); 

String str1 = new String(b);
System.out.println(str1);

Can you please clarify how this 3 bytes has been calculated for the foreign character

hyde
  • 50,653
  • 19
  • 110
  • 158
Senthil
  • 35
  • 1
  • 4
  • 1
    How did you converted your String? [This](https://docs.oracle.com/javase/tutorial/i18n/text/string.html) would help – SMA Dec 27 '15 at 07:28
  • {String str = "晝"; byte[] b = str.getBytes(); for(byte bt : b) System.out.println(bt); String str1 = new String(b); System.out.println(str1);} – Senthil Dec 27 '15 at 09:44
  • Possible duplicate of [How does UTF-8 "variable-width encoding" work?](http://stackoverflow.com/questions/1543613/how-does-utf-8-variable-width-encoding-work) – Joe Dec 30 '15 at 14:50

2 Answers2

3

晝 is U+665D. It looks like when you converted it, you converted it to UTF-8. UTF-8 is a variable length encoding of Unicode characters. Characters in [U+0800, U+FFFF] are converted to 3 bytes.

According to this converter, U+665D is E6 99 9D in UTF-8 (in hex, 230 153 157 in decimal, which will be needed in a bit). Because a byte is -128 to 127, values larger than 127 are displayed as the number less 256, so as bytes, 230 153 157 is 230-256 153-256 157-256, or -26 -103 -99, which is what you're seeing.

blm
  • 2,145
  • 2
  • 16
  • 21
1

All conversions from characters to bytes uses some character set do do the encoding.

You don't say, but I assume that you did the conversion using String.getBytes(). This is simply a shortcut for String.getBytes(Charset.defaultCharset()) and the default Charset depends on your particular Java environment. The three values you report are (in hex) 0xE6 0x99 0x9D. which is the UTF-8 encoding of U+665D (Unicode Han Character 'daytime, daylight'). Since that's the character that you report having started with, presumably the default character set for your environment is UTF-8 (which is not a surprise, but not something you can count on everywhere).

Ted Hopp
  • 222,293
  • 47
  • 371
  • 489
  • Thanks for clarifying – Senthil Dec 27 '15 at 07:48
  • Here is the code I used {String str = "晝"; byte[] b = str.getBytes(); for(byte bt : b) System.out.println(bt); String str1 = new String(b); System.out.println(str1);} I haven't specified encoding in the code. Eclipse shows default cp1252 encoding. I am executing the code from Widows server. From where does UTF-8 is picked up...can you please help to understand? – Senthil Dec 27 '15 at 08:59
  • @Senthil - That's pretty much what I thought you were doing. It's using the default character set for your environment, which is most likely UTF-8 (or at least agrees with UTF-8 for that particular character). You can report your default charset using `System.out.println(Charset.defaultCharset().toString());` – Ted Hopp Dec 27 '15 at 09:04
  • Yes,thanks Ted Hopp. UTF-8 is the default character set picked from Windows 2008 server ,it seems. If I wish to change it to different encoding, from where should I do it? – Senthil Dec 27 '15 at 09:38