UTF-8 String class for java

Question

I need to hold lots of string objects in memory (hundreds of MB) and I want to hold them in UTF-8 format since in most cases it will require half of the memory the default implementation use.
The default String class requires for a 12 characters string 60 bytes (See http://blog.griddynamics.com/2010/01/java-tricks-reducing-memory-consumption.html).
Most of my Strings are 10-20 characters long.
I wonder if there is some open source library which offers a wrapper for such strings?
I know how to convert String to UTF-8 byte array but I'm looking for a wrapper class which will provide all needed utilities functions (Hash, Equal, toString, fromString, etc).

http://docs.oracle.com/javase/tutorial/i18n/text/string.html — tckmn, Jan 09 '13 at 14:05
Java stores all strings internally in UTF-16, so you 12 characters strings are 24 bytes internally. Not counting the obligatory object overhead, where does that 60 bytes figure come from? — fge, Jan 09 '13 at 14:06
...minimum 24 bytes, as UTF encodings are variable length (granted, you'd have to use some seriously exotic characters to exceed 24 bytes in the OP's example) — Anders R. Bystrup, Jan 09 '13 at 14:23
Define "lots". Are you talking megabytes or gigabytes? And how big are your strings? Unless you're talking gigabytes of long strings, you won't find the savings you're expecting (I've been there). Depending on your application, canonicalization might be a better choice. — parsifal, Jan 09 '13 at 14:32
There was a `UseCompressedStrings` JVM option in some Sun JVM versions but I believe [it was removed](http://stackoverflow.com/questions/8833385) in Java 7. It might be available if you're on an earlier version. — McDowell, Jan 09 '13 at 14:42
Your ability to save memory will depend upon how static the data is. The 60 byte figure comes from overhead due to manipulating the strings, and waste from inability to clean up. The String class was optimized to all for efficient "substring" manipulation. This waste is inherent in wanting these methods. You can save this overhead by carefully restricting the operations on your desired new string class. But you need to be clear on what you need. — AgilePro, Jan 09 '13 at 15:26
to Burnt Too Many Times: Isn't the 60 bytes true in case I do: String test=new String("123456789012)? — Avner Levy, Jan 10 '13 at 16:13
@fge - In the standard implementation there is a double object overhead for String -- a String object header and then a `char[]` object header. This can be optimized down, but Sun wasn't doing it last I saw (though some IBM JVMs were, to considerable advantage). — Hot Licks, Jan 10 '13 at 21:46
@McDowell - I don't think the UseCompressedStrings option was ever really implemented. — Hot Licks, Jan 10 '13 at 21:47

Grooveek · Accepted Answer · 2013-01-09T15:39:53.913

2

Apache Avro has an UTF8 wrapper class which implements CharSequence, but I don't know the memory consumption of such objects

Hadoop has the Text class which has quite the kind of interface you desire

edited Jan 09 '13 at 15:39

answered Jan 09 '13 at 15:13

Grooveek

9,639
1
25
34

did you mean to make both links the same? – AgilePro Jan 09 '13 at 15:27

score 0 · Answer 2 · answered Jan 10 '13 at 21:48

If you want a distinct object for each string and you want them as compact as possible then use byte arrays. That will be 1 byte per char vs 2, and you won't have the overhead of the String header (which adds probably 32 bytes per object).

But of course you wouldn't be able to use any String methods on these without first converting to String.

But if you really want to save space, store the strings back-to-back in a few larger arrays, with "dope vectors" to locate the individual strings.

UTF-8 String class for java

2 Answers2