8

I am receiving a String via an object from an axis webservice. Because I'm not getting the string I expected, I did a check by converting the string into bytes and I get C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297 in hexa, when I'm expecting E4BDA0 E5A5BD E59097 which is actually 你好吗 in UTF-8.

Any ideas what might be causing 你好吗 to become C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297? I did a Google search but all I got was a chinese website describing a problem that happens in python. Any insights will be great, thanks!

Maurice
  • 6,273
  • 13
  • 49
  • 75

2 Answers2

17

You have what is known as a double encoding.

You have the three character sequence "你好吗" which you correctly point out is encoded in UTF-8 as E4BDA0 E5A5BD E59097.

But now, start encoding each byte of THAT encoding in UTF-8. Start with E4. What is that codepoint in UTF-8? Try it! It's C3 A4!

You get the idea.... :-)

Here is a Java app which illustrates this:

public class DoubleEncoding {
    public static void main(String[] args) throws Exception {
        byte[] encoding1 = "你好吗".getBytes("UTF-8");
        String string1 = new String(encoding1, "ISO8859-1");
        for (byte b : encoding1) {
            System.out.printf("%2x ", b);
        }
        System.out.println();
        byte[] encoding2 = string1.getBytes("UTF-8");
        for (byte b : encoding2) {
            System.out.printf("%2x ", b);
        }
        System.out.println();
    }
}
samabcde
  • 3,622
  • 1
  • 18
  • 28
Ray Toal
  • 79,229
  • 13
  • 156
  • 215
  • Hi Ray, do you have sample coding to produce the problem? I tried String chinese = new String("你好吗".getBytes("UTF-8")); String chineseAgain = new String(chinese.getBytes("UTF-8")); System.out.println(byteArrayToHexString(chineseAgain.getBytes("UTF-8"))); But I can't replicate the problem. – Maurice Jul 27 '11 at 01:31
  • I can write you one real quick. But FYI I think I wasn't clear enough. The E4 which is the first byte in the first UTF encoding is interpreted as a *codepoint*, not a hexstirng, in the second encoding. Does that help? – Ray Toal Jul 27 '11 at 01:41
  • @Maurice I added the concrete example to my answer for you. Hope it helps! – Ray Toal Jul 27 '11 at 01:50
  • Thanks Ray! You are a lifesaver, I wanna vote for this but I can't register for some reason. Just a last question. If I receive the C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297 as a String, what's the proper way to format this back to UTF-8 so I can save 你好吗 into my database as UTF-8? – Maurice Jul 27 '11 at 01:54
  • Just reverse the steps above: First _parse_ `C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297` to the byte array `[\xc3, \xa4, \xc2, ...]` then do `new String(byteArray, "UTF-8")` then encode using ISO8859-1 then `new String` that with UTF-8. – Ray Toal Jul 27 '11 at 02:24
  • Thank you very much! I used this and I got the string back. String test = new String(encoding2,"UTF-8"); byte[]test2 = test.getBytes("ISO8859-1"); System.out.println("Result: " + new String(test2,"UTF-8")); – Maurice Jul 27 '11 at 02:29
0
public class Encoder{
    public static void main(String[] args) throws Exception {
     String requestString="你好";
     String ISO = new String(requestString.getBytes("gb2312"), "ISO8859-1");
     String plaintxt = new String(ISO.getBytes("ISO8859-1"), "gb2312");
     plaintxt.getBytes("UTF-8");
    }
}