2

Let's say I have a byte array and I try to encode it to UTF_8 using the following

String tekst = new String(result2, StandardCharsets.UTF_8);
System.out.println(tekst);
//where result2 is the byte array

Then, I get the bytes using getBytes() with values from 0 to 128

byte[] orig = tekst.getBytes();

And then, I wish to do a frequency count of my byte[] orig using the ff:

int frequencies = new int[256];

for (byte b: orig){
    frequencies[b]++;
}

Everything goes well till I encounter an error which states

java.lang.ArrayIndexOutOfBoundsException: -61

Does that mean that my byte still contains negative values despite converting it to UTF-8? Is there something wrong that I'm doing? Can someone please give me clarity on this cause I'm still a beginner on the subject. Thank you.

markdid
  • 55
  • 1
  • 5
  • This is essentially a duplicate of http://stackoverflow.com/questions/3621067/why-is-the-range-of-bytes-128-to-127-in-java – Oleg Estekhin May 10 '17 at 08:40
  • "Encode it to UTF_8" implies the data is text. "Have a byte array" implies it is encoded with some character encoding. If so, which? Or, as Jon Skeet points out, if it is not yet text, you should first convert it to text using Base64 or similar. – Tom Blodget May 10 '17 at 22:25

1 Answers1

7

Answering the specific question

Does that mean that my byte still contains negative values despite converting it to UTF-8?

Yes, absolutely. That's because byte is signed in Java. A byte value of -61 would be 195 as an unsigned value. You should expect to get bytes which aren't in the range 0-127 when you encode any non-ASCII text with UTF-8.

The fix is easy: just clamp the range to 0-255 with a bit mask:

frequencies[b & 0xff]++;

Addressing what you're attempting to do

This line:

String tekst = new String(result2, StandardCharsets.UTF_8);

... is only appropriate if result2 is genuinely UTF-8-encoded text. It's not appropriate if result2 is some arbitrary binary data such as an image, compressed data, or even text encoded in some other encoding.

If you want to preserve arbitrary binary data as a string, you should use something like Base64 or hex. Basically, you need to determine whether your data is inherently textual (in which case, you should use strings for as much of the time as possible, and use an appropriate Charset to convert to binary where necessary) or inherently binary (in which case you should use bytes for as much of the time as possible, and use base64 or hex to convert to text where necessary).

This line:

byte[] orig = tekst.getBytes();

... is almost always a bad idea. It uses the platform-default encoding to convert a string to bytes. If you really, really want to use the platform-default encoding, I would make that explicit:

byte[] orig = tekst.getBytes(Charset.defaultCharset());

... but this is an extremely unusual requirement these days. It's almost always better to stick to UTF-8 everywhere.

Jon Skeet
  • 1,261,211
  • 792
  • 8,724
  • 8,929