11

I would like to make "compressed array"/"compressed vector" class (details below), that allows random data access with more or less constant time.

"more or less constant time" means that although element access time isn't constant, it shouldn't keep increasing when I get closer to certain point of the array. I.e. container shouldn't do significantly more calculations (like "decompress everything once again to get last element", and "do almost nothing to get the first") to get one element. Can be probably achieved by splitting array into chunks of compressed data. I.e. accessing one element should take "averageTime" +- some deviation. I could say that I want best-case access time and worst-case access time to be relatively close to average access time.

What are my options (suitable algorithms/already available containers - if there are any)?

Container details:

  1. Container acts as a linear array of identical elements (such as std::vector)
  2. Once container is initialized, data is constant and never changes. Container needs to provide read-only access.
  3. Container should behave like array/std::vector - i.e. values accessed via operator[], there is .size(), etc.
  4. It would be nice if I could make it as template class.
  5. Access to data should be more or less constant-time. I don't need same access time for every element, but I shouldn't have to decompress everything to get last element.

Usage example:
Binary search on data.

Data details:
1. Data is structs mostly consisting of floats and a few ints. There are more floats than ints. No strings.
2. It is unlikely that there are many identical elements in array, so simply indexeing data won't be possible.
3. Size of one element is less than 100 bytes.
4. Total data size per container is between few kilobytes and a few megabytes.
5. Data is not sparse - it is continuous block of elements, all of them are assigned, there are no "empty slots".

The goal of compression is to reduce amount of ram the block takes when compared to uncompressed representation as array, while keeping somewhat reasonable read access performance, and allowing to randomly access elements as array. I.e. data should be stored in compressed form internally, and I should be able to access it (read-only) as if it is a std::vector or similar container.

Ideas/Opinions?

SigTerm
  • 24,947
  • 5
  • 61
  • 109
  • 1
    What is "more or less" constant time ? Either it is constant, either it is not. Otherwise interesting question. Are you sure you can't do what you want with the many existing container classes ? – ereOn Aug 06 '10 at 13:21
  • 3
    where does the "compressed" part enter into it? You never explain that part. Could you just use a vector of pointers to gzipped blobs, or something like that? Or do you mean compressed as in you have a sparse dataset so a naive vector would have a lot of empty slots? – jalf Aug 06 '10 at 13:28
  • Also you say that elements are only floats and ints, and that one element never exceeds 100 bytes. Unless you work on some 800 bits architecture, you can pretty much omit the last requirement. – ereOn Aug 06 '10 at 13:34
  • @ereOn: "What is "more or less" constant time ?" Updated the question, see explanation. – SigTerm Aug 06 '10 at 13:37
  • @jalf: updated the question, see \#5 in "data details". – SigTerm Aug 06 '10 at 13:39
  • I assume that by "more or less constant time", you mean what is formally referred to as "amortized constant time" (see http://en.wikipedia.org/wiki/Amortized_analysis) -- is that right? – Martin B Aug 06 '10 at 13:42
  • @Martin B.: I'm not sure if it describes what I want. I could say that I want best-case access time and worst-case access time to be relatively close to average access time. If you simply compress everything into one block, best-case/worst-case will be too far from average. – SigTerm Aug 06 '10 at 13:48
  • 2
    If the typical size of an entry is around 100 bytes, it should make sense to compress each element individually, but not compress the array as a whole. In other words, each array element would be a compressed representation of your struct instead of the uncompressed struct itself. The same coding table should probably be used for all array elements. – Martin B Aug 06 '10 at 13:49
  • You've indicated that there are very few identical elements; how well do you expect this data to (losslessly) compress? – meager Aug 06 '10 at 13:56
  • @Martin B: "it should make sense to compress each element individually". Makes sense, but using which algorithm? As I remember, zlib (for example) starts actually compressing things when data grows larger than 100 or so bytes. Also one element can be as small as 28 bytes. – SigTerm Aug 06 '10 at 13:57
  • 1
    Do the numbers in the structs tend to exhibit any useful properties when treated as a sequence? For example, if they were in ascending order I would suggest using delta-encoding (storing the difference between each number and the previous one, rather than storing the number itself) and then using variable-byte encoding on the deltas so that they took up less space. Lookup would be linear-time, but you could improve that by having every Mth number (for some reasonably small M) encoded normally rather than as a delta. I'm not sure if this is what you want, though. – David Aug 06 '10 at 13:59
  • @meagar: On data types where I would like to try that, there will be many elements that have 23..42% of identical bytes, but almost no absolutely identical elements. I cannot estimate how well it could compress. – SigTerm Aug 06 '10 at 14:03
  • @David: This is a very interesting idea, but I would prefer generic container. Besides, data may be generated by external program, and in this case, while there will be a pattern in how data changes, it may be difficult to find it out in a program. – SigTerm Aug 06 '10 at 14:10
  • 1
    @SigTerm Did a double-take when I read the total size of the container is up to "a few megabytes". Why are you worried about compressing this in the first place? – meager Aug 06 '10 at 14:13
  • @SigTerm: Huffman coding or (patents permitting) arithmetic coding is what I would start with. Zlib likely only starts compressing at 100 bytes because it needs a certain amount of space to store its coding table, and the break-even point probably occurs around 100 bytes. In your case, you would using the same coding table for all array elements, so that concern doesn't apply. Edit: See Cubbi's and Heinrich's answers for more detail on how this would work. – Martin B Aug 06 '10 at 14:19
  • @meager: "Why are you worried about compressing this in the first place?" Two reasons: 1. There will be many such blocks, data in them contains a bit of redundant information that still cannot be easily discarded, indexed, and so on. It just "asks" to be compressed. 2. I've been thinking about this problem (from purely theoretical point) for a while, so I would like to find out how this can be done. – SigTerm Aug 06 '10 at 14:25

5 Answers5

11

I take it that you want an array whose elements are not stored vanilla, but compressed, to minimize memory usage.

Concerning compression, you have no exceptional insight about the structure of your data, so you're fine with some kind of standard entropy encoding. Ideally, would like like to run GZIP on your whole array and be done with it, but that would lose O(1) access, which is crucial to you.

A solution is to use Huffmann coding together with an index table.

Huffmann coding works by replacing each input symbol (for instance, an ASCII byte) with another symbol of variable bit length, depending on frequency of occurency in the whole stream. For instance, the character E appears very often, so it gets a short bit sequence, while 'W' is seldom and gets a long bit sequence.

E -> 0b10
W -> 0b11110

Now, compress your whole array with this method. Unfortunately, since the output symbols have variable length, you can no longer index your data as before: item number 15 is no longer at stream[15*sizeof(item)].

Fortunately, this problem can solved by using an additional index table index that stores where each item start in the compressed stream. In other words, the compressed data for item 15 can be found at stream[index[15]]; the index table accumulates the variable output lengths.

So, to get item 15, you simply start decompressing the bytes at stream[index[15]]. This works because the Huffmann coding doesn't do anything fancy to the output, it just concatenates the new code words, and you can start decoding inside the stream without having to decode all previous items.

Of course, the index table adds some overhead; you may want to tweak the granularity so that compressed data + index table is still smaller than original data.

Heinrich Apfelmus
  • 11,039
  • 1
  • 35
  • 66
  • 1
    For modification (of the elements themselves, not the length of the vector), the index table could be a Fenwick tree. This would allow to recompute the index on the fly with minimal changes. – Matthieu M. Aug 06 '10 at 15:23
4

Are you coding for an embedded system and/or do you have hundreds or thousands of these containers? If not, while I think this is an interesting theoretical question (+1), I suspect that the slowdown as a result of doing the decompression will be non-trivial and that it would be better to use use a std::vector.

Next, are you sure that the data you're storing is sufficiently redundant that smaller blocks of it will actually be compressible? Have you tried saving off blocks of different sizes (powers of 2 perhaps) and tried running them through gzip as an exercise? It may be that any extra data needed to help the decompression algorithm (depending on approach) would reduce the space benefits of doing this sort of compressed container.

If you decide that it's still reasonable to do the compression, then there are at least a couple possibilities, none pre-written though. You could compress each individual element, storing a pointer to the compressed data chunk. Then index access is still constant, just needing to decompress the actual data. Possibly using a proxy object would make doing the actual data decompression easier and more transparent (and maybe even allow you to use std::vector as the underlying container).

Alternately, std::deque stores its data in chunks already, so you could use a similar approach here. For example std::vector<compressed_data_chunk> where each chunk holds say 10 items compressed together as your underlying container. Then you can still directly index the chunk you need, decompress it, and return the item from the decompressed data. If you want, your containing object (that holds the vector) could even cache the most recently decompressed chunk or two for added performance on consecutive access (although this wouldn't help binary search very much at all).

Mark B
  • 91,641
  • 10
  • 102
  • 179
  • but... binary search hits a very few elements very frequently. Keeping the key values of these few items uncompressed might make the decompression penalty almost go away without significantly increasing the total size. – Ben Voigt Sep 13 '10 at 03:11
3

I've been thinking about this for a while now. From a theoretical point of view I identified 2 possibilities:

  • Flyweight, because repetition can be lessened with this pattern.
  • Serialization (compression is some form of serialization)

The first is purely object oriented and fits well I think in general, it doesn't have the disadvantage of messing up pointers for example.

The second seems better adapted here, although it does have a slight disadvantage in general: pointer invalidation + issues with pointer encoding / decoding, virtual tables, etc... Notably it doesn't work if the items refer to each others using pointers instead of indices.

I have seen a few "Huffman coding" solutions, however this means that for each structure one needs to provide a compressing algorithm. It's not easy to generalize.

So I'd rather go the other way and use a compressing library like 'zlib', picking up a fast algorithm like lzo for example.

  • B* tree (or a variant) with large number of items per node (since it doesn't move) like say 1001. Each node contains a compressed representation of the array of items. Indices are not compressed.
  • Possibly: cache_view to access the container while storing the last 5 (or so) decompressed nodes or something. Another variant is to implement reference counting and keep the data uncompressed as long as someones got a handle to one of the items in the node.

Some remarks:

  • if you should a large number of items/keys per node you have near constant access time, for example with 1001 it means that there are only 2 levels of indirection as long as you store less than a million items, 3 levels of indirection for a billion etc...
  • you can build a readable/writable container with such a structure. I would make it so that I only recompress once I am done writing the node.
Matthieu M.
  • 251,718
  • 39
  • 369
  • 642
0

Okay, from the best of my understanding, what you want is some kind of accessor template. Basically, create a template adapter that has as its argument one of your element types which it accesses internally via whatever, a pointer, an index into your blob, etc. Make the adapter pointer-like:

const T &operator->(void) const;

etc. since it's easier to create a pointer adapter than it is a reference adapter (though see vector if you want to know how to write one of those). Notice, I made this accessor constant as per your guidelines. Then, pre-compute your offsets when the blob is loaded / compressed and populate the vector with your templated adapter class. Does this make sense? If you need more details, I will be happy to provide.

As for the compression algorithm, I suggest you simply do a frequency analysis of bytes in your blob and then run your uncompressed blob through a hard-coded Huffman encoding (as was more or less suggested earlier), capturing the offset of each element and storing it in your proxy adapter which in turn are the elements of your array. Indeed, you could make this all part of some compression class that compresses and generates elements that can be copy-back-inserted into your vector from the beginning. Again, reply if you need sample code.

TimeHorse
  • 479
  • 3
  • 12
0

Can some of the answers to "What is the best compression algorithm that allows random reads/writes in a file?" be adapted to your in-memory data?

Community
  • 1
  • 1
David Cary
  • 4,726
  • 6
  • 47
  • 62