8

I need to store a large dictionary of natural language words -- up to 120,000, depending on the language. These need to be kept in memory as profiling has shown that the algorithm which utilises the array is the time bottleneck in the system. (It's essentially a spellchecking/autocorrect algorithm, though the details don't matter.) On Android devices with 16MB memory, the memory overhead associated with Java Strings is causing us to run out of space. Note that each String has a 38 byte overhead associated with it, which gives up to a 5MB overhead.

At first sight, one option is to substitute char[] for String. (Or even byte[], as UTF-8 is more compact in this case.) But again, the memory overhead is an issue: each Java array has a 32 byte overhead.

One alternative to ArrayList<String>, etc. is to create an class with much the same interface that internally concatenates all the strings into one gigantic string, e.g. represented as a single byte[], and then store offsets into that huge string. Each offset would take up 4 bytes, giving a much more space-efficient solution.

My questions are a) are there any other solutions to the problem with similarly low overheads* and b) is any solution available off-the-shelf? Searching through the Guava, trove and PCJ collection libraries yields nothing.

*I know one can get the overhead down below 4 bytes, but there are diminishing returns.

NB. Support for Compressed Strings being Dropped in HotSpot JVM? suggests that the JVM option -XX:+UseCompressedStrings isn't going to help here.

Community
  • 1
  • 1
Mohan
  • 4,674
  • 5
  • 26
  • 45
  • An array can only have 2^31-1 = 2.1G entries, maybe too small for you? – maraca Jun 29 '15 at 17:45
  • 1
    No... a word typically takes up ~10 bytes, so the whole structure will fit in ~1MB. (~1.5MB inc. overhead.) – Mohan Jun 29 '15 at 17:48
  • Do you really need to keep all the strings in memory? Probably you can keep some index and effectively load the necessary part from the file? What's your original task? How do you use these strings? – Tagir Valeev Jun 29 '15 at 17:48
  • Have you thought about using an in-memory database, such as H2? – MadConan Jun 29 '15 at 17:49
  • Btw, compressed strings may come back in OpenJDK 9 or 10: see [this JEP](http://openjdk.java.net/jeps/8054307). – Tagir Valeev Jun 29 '15 at 17:50
  • @Tagir Valeev: yes, I really do. I've measured. In fact, the time to process each string is a fair bit less than the time of a L2 cache miss. – Mohan Jun 29 '15 at 17:51
  • If space and speed of retrieval are really important, you should consider a Directed Acyclic Word Graph (DAWG) data structure to store your words. This data structure essentially keeps all the common prefixes and suffixes for your lexicon and stores them just once where they can be referenced by all the words that use them. – Michael Krause Jun 29 '15 at 17:53
  • For readers, the question is targetted to Android devices. Suggestions like using Java 8 or 9 should not be considered. – Luiggi Mendoza Jun 29 '15 at 17:54
  • Could you elaborate more about how you are going to use this structure? E.g. search for closest word with hamming distance or just see if a word exists in the dictionary? – maraca Jun 29 '15 at 17:55
  • @MadConan: Having read all of http://www.h2database.com/html/performance.html, I can't find any data on the per-string memory overhead. But the b-trees mentioned seem likely to come with substantial space overheads of their own. Databases seem too heavyweight for this case. – Mohan Jun 29 '15 at 17:56
  • @Mohan have you considered using a [trie](https://en.wikipedia.org/wiki/Trie) rather than lots of `String`s? – Luiggi Mendoza Jun 29 '15 at 17:58
  • @Luiggi Mendoza: a single pointer on a 64-bit machine (8 bytes) takes up almost as much space as a typical English word (~10 bytes). So AFAICS the overheads would drown the savings. Plus 100,000 words is small enough that about 40% of the tail of a string doesn't overlap with the earlier parts. – Mohan Jun 29 '15 at 18:02
  • @Michael Krause: sorry, missed yr comment -- but see my last comment on pointer size. – Mohan Jun 29 '15 at 18:03
  • @Mohan: As you mentioned, the size of pointers on a 64 bit architecture would add overhead. But, if you're careful about how you build the data structure, you could load the structure in such a way that it is all within a contiguous block of memory and those references to other nodes could be relative offsets within the memory block which could be smaller than the 8 bytes you would need if the nodes are allowed to exist anywhere in memory. Another benefit of using a Trie or a DAWG is it opens up the opportunity to do fuzzy word matching. – Michael Krause Jun 29 '15 at 18:15
  • If you take an array then you also have to have a delimiter or also save the length of the word and not only the index. With the latter approach the array could be compressed: e.g. the word woman already has the words man, an, a already in it, so if they are concatenated wisely so that they are overlapping you could save quite a bit... – maraca Jun 29 '15 at 18:18
  • @maraca: comparing the offsets of words n and n+1 gives you the length, if you care about saving 1 byte/word. – Mohan Jun 29 '15 at 18:41
  • As a general comment: it's not so much that I'm trying to save space relative to the array. This just seems like such a standard problem that I'm _very_ surprised there aren't off-the-shelf solutions (whether array or something else)... – Mohan Jun 29 '15 at 18:43
  • I guess your solution is the optimal one. an `ArrayList` is nothing but an array of pointers; and now your are just managing your own pointers. – ZhongYu Jun 29 '15 at 19:32
  • I think this is a duplicate of http://stackoverflow.com/questions/3228075/hashset-of-strings-taking-up-too-much-memory-suggestions/. See also http://stackoverflow.com/questions/518936/the-best-way-to-store-and-access-120-000-words-in-java – Andy Thomas Jun 29 '15 at 19:38

1 Answers1

0

I had to develop a word dictionary for a class project. we ended up using a trie as a data structure. Not sure the size difference between an arrraylist and a trie, but the performance is a lot better.

Here are some resources that could be helpful.

https://en.wikipedia.org/wiki/Trie

https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/

jmurphy1267
  • 65
  • 1
  • 8