2

I ran into an issue which I can't figure out. Here is the definition of the problem: I have some data in a Blob column in Db2/Linux environment. Blob was written into DB2 after the byte[] was compressed using JDK compression (code that does this is running in Linux environment). I am trying to write a simple program to read some of this data decompress it (using JDK) and create a String from the decompressed byte array in Windows Environment (my development environment). Issue is that after I decompress the Blob (byte[]), length of the decompressed byte array is usually 1-3 bytes longer than expected. What I mean by expected is that the offset and length fields are also being stored in the database. So in this case, length of the decompressed byte array is usually longer than the stored length in database, just a few bytes. So if I create a String object from the decompressed byte array and create another String object using the substring(offset, length) method using the offset and length fields from the database, my second String(the one I got by using substring method) is shorter.

An example would be: database record contains a blob, offset: 0, length: 260,409 after decompressing the blob -

 compressedByte[].length  - 71,212
 decompressedByte[].length   - 260,412
 new String(decompressByte[]).length()  - 260,412
 new String(decompressByte[]).subString(0, 260,409).length() - 260409

For some other input records, the difference I am seeing is anywhere between 1-3 bytes in length.

I am sort of puzzled with this issue and wondering if anyone could suggest any tips so I can do more debugging to figure this issue out. I am wondering whether this could be somehow related to how bytes are being stored/written in Linux environment and how they are being read in Windows? Thanks for your help.

John
  • 817
  • 3
  • 10
  • 19

2 Answers2

3

I suspect the default encoding is different between the two systems.

// on the linux box   
byte [] blob = str.getBytes("UTF-8");

// in your code 
String str = new String(blob, "UTF-8");

Or at the least find out what the default encoding is on the linux box is (normal UTF-8) and skip step 1.

A really good examplation of what could be happening here is on Joel on software

Gareth Davis
  • 26,716
  • 11
  • 69
  • 103
  • Yup, this was it, it was encoded differently. Thanks for your answer. Using new String(byte[], "UTF-8") fixed the problem. I will read that article tonight - looks like there is ton of good information in there. Too bad I can't vote up this answer since I don't have enough reputation yet. – John Jan 06 '11 at 15:39
  • you can upvote on your own question (I think) also you can click that big tick to mark this answer as the accepted answer. Joels artical is total gold and basically required reading. – Gareth Davis Jan 06 '11 at 15:43
  • Be aware that there are plenty of byte sequences which cannot decode into a character using the UTF-8 character encoding. You'd be better off using US-ASCII which will do a direct 1-to-1 mapping. Or... don't use String as a holder for bytes – dty Jan 06 '11 at 15:52
  • Agreed UTF-8 may not be the most suitable encoding in every occasion, but if you are converting a String to an array of bytes I strongly recomend not using US-ASCII unless you are certain that the String string doesn't contain any characters above the 128 mark. Note that for such a string the UTF-8 output identical anyhow. – Gareth Davis Jan 06 '11 at 16:01
  • 1
    Depends whether the original data was a String or not. If you've started with byte[] (e.g. JPG, or whatever), don't go near a String, or, if you must, use ISO-8859-1 (I didn't really mean US-ASCII above - you're right, that's a 7-bit charset). If you've started with a String and are trying to go via a byte[], then use UTF-8. – dty Jan 06 '11 at 16:18
2

A String is not a general holder for bytes. You will undoubtedly have different default character encodings between your db2/Linux environment and your Windows environment which will be causing the conversion back and forth between bytes and characters to be different.

dty
  • 18,132
  • 6
  • 51
  • 78