-1

I call a library method that returns an object of type Serializable. In most cases the value is a simple String, so I typecast the returned value to String. I do the following to retrieve the String:

String val = (String)data.get("MyString");

There is a problem though when the String retrieved contains non-ascii. For example 'Køllert', the value that is returned is displayed as 'KxF8llert' The 'ø' is replaced with xF8 which is the corresponding Unicode Hex value.

When I print out the value as bytes, the character prints as -8.

    byte[] defaultBytes = val.getBytes();
    for(int ii=0; ii<defaultBytes.length; ii++) print((int)defaultBytes[ii]);

Is there a way to 'clean' the returned string to be printable as standard Unicode so the character is correctly displayed.

Edit

When I input the actual string as follows, the string can be correctly printed and when the bytes are examined, the character takes up two bytes with the integer values -61 and -72. Maybe it is returning UTF-8 instead of Unicode?

    String val1 = "Køllert";
    byte[] defaultBytes1 = val1.getBytes();
    for(int ii=0; ii<defaultBytes1.length; ii++) print((int)defaultBytes1[ii]);

Solution

Sorry that the question may have been vague. The following seems to work for me. It's not so complicated, but had me spinning.

String val = new String(data.get("MyString").getBytes("UTF-8"));
George Hernando
  • 2,320
  • 6
  • 30
  • 51
  • See [UTF-8 byte to String](https://stackoverflow.com/questions/8512121/utf-8-byte-to-string). – rossum Jun 16 '20 at 22:19
  • No. That doesn't do anything. The string is unchanged. I edited the original text of the question with more information. – George Hernando Jun 16 '20 at 22:38
  • To help us help you, could you provide an equivalent `byte[]` so we can understand your problem and test our ideas? ie add to your question something like `byte[] bytes = {'K', -8, 'l', 'l', 'e', 'r', 't'};` or whtever is your case. Be sure it's correct though! – Bohemian Jun 16 '20 at 23:18
  • Just reading first part (before edit) of the question, and it's quite vague. What prints -8? what you mean in `value that is returned..`? please be more specific. – Giorgi Tsiklauri Jun 16 '20 at 23:33

2 Answers2

1

Maybe it is returning UTF-8 instead of Unicode?

Serialization produces a byte stream. The obvious, economical, and non-lossy way to turn a Java string, which is a sequence of Unicode characters stored as UTF-16, into a byte stream is to convert it to a sequence of Unicode characters stored as UTF-8.

(UTF-16 and UTF-8 are equally valid representations of Unicode)

Given that there is a conversion of a String into the serialized form, you can't skip the reverse conversion of the serialized form into a String.

Why isn't there a reverse conversion in whatever you used to do the serialization?

If we're right in guessing that the serialized form is UTF-8, then to convert to a String, you use String(data.get(whatever)). If it's not UTF-8, then it's the internal business of the serializing code, and presumably it offers a complementary deserializer.

Regardless, you can't do data conversion by just claiming that what you have is already a String (which is what a cast is).

1

I'll move my comment as an answer, as it seems like it was helptful.

As I've stated in the comment above, you might want to know beforehand what is the encoding which your bytearray's byte elements will be storing.

Hence instead of stringObject.getBytes() - which encodes your string into a sequence of bytes using the platform's default charset, storing the result into a new byte array,

you might want to use

stringObject.getBytes("character-encoding") - which encodes your string into a sequence of bytes using the given character-encoding, storing the result into a new byte array.

It seems that your should've used second version of above, as it will encode your string into the given encoding.

Giorgi Tsiklauri
  • 6,699
  • 7
  • 29
  • 54