I need to store a large dictionary of natural language words -- up to 120,000, depending on the language. These need to be kept in memory as profiling has shown that the algorithm which utilises the array is the time bottleneck in the system. (It's essentially a spellchecking/autocorrect algorithm, though the details don't matter.) On Android devices with 16MB memory, the memory overhead associated with Java String
s is causing us to run out of space. Note that each String
has a 38 byte overhead associated with it, which gives up to a 5MB overhead.
At first sight, one option is to substitute char[]
for String
. (Or even byte[]
, as UTF-8 is more compact in this case.) But again, the memory overhead is an issue: each Java array has a 32 byte overhead.
One alternative to ArrayList<String>
, etc. is to create an class with much the same interface that internally concatenates all the strings into one gigantic string, e.g. represented as a single byte[]
, and then store offsets into that huge string. Each offset would take up 4 bytes, giving a much more space-efficient solution.
My questions are a) are there any other solutions to the problem with similarly low overheads* and b) is any solution available off-the-shelf? Searching through the Guava, trove and PCJ collection libraries yields nothing.
*I know one can get the overhead down below 4 bytes, but there are diminishing returns.
NB. Support for Compressed Strings being Dropped in HotSpot JVM? suggests that the JVM option -XX:+UseCompressedStrings
isn't going to help here.