Disk-persisted-lazy-cacheable-List ™ in Scala

Question

I need to have a very, very long list of pairs (X, Y) in Scala. So big it will not fit in memory (but fits nicely on a disk).

All update operations are cons (head appends).
All read accesses start in the head, and orderly traverses the list until it finds a pre-determined pair.
A cache would be great, since most read accesses will keep the same data over and over.

So, this is basically a "disk-persisted-lazy-cacheable-List" ™

Any ideas on how to get one before I start to roll out my own?

Addendum: yes.. mongodb, or any other non-embeddable resource, is an overkill. If you are interested in a specific use-case for this, see the class Timeline here. Basically, I which to have a very, very big timeline (millions of pairs throughout months), although my matches only need to touch the last hours.

If you end up rolling your own, you'll probably want to implement something page-based. The requirement for head-appends makes things interesting, because files are appendable, but only at the end, and presumably you'd not want to read through the whole file to read the latest values. — Chris Shain, Jan 30 '12 at 00:50
So just to be clear, you're looking for a Scala-based solution, and *not* an OS-based solution? Dealing with paging and swapping between disk and memory is typically viewed as an Operating System service. — Dan Burton, Jan 30 '12 at 02:03
"All read accesses start in the head" and "a very, very long list of pairs"...are you sure you want O(n) lookup? — Nicholas White, Jan 30 '12 at 07:02
@Chris Shain: the file should store the list in reverse order, so prepending to the list is appending to the file — Nicholas White, Jan 30 '12 at 07:08
I don't want random access to the collection, so I am expecting O(1) to the head, and O(n) for the first n elements. — Hugo Sereno Ferreira, Jan 30 '12 at 13:16
@RexKerr It really shouldn't matter... My specific use case is `X` is a `Long`, and `Y` anything serializable. — Hugo Sereno Ferreira, Feb 02 '12 at 09:37
@HugoSFerreira - It matters a great deal if you can keep the hash codes for the items in memory. In your case, that's a "maybe". Could you afford 8 bytes per pair in memory? — Rex Kerr, Feb 02 '12 at 16:33
@RexKerr, in the specific case I'm intending to use this, yes. So let's assume such. — Hugo Sereno Ferreira, Feb 02 '12 at 22:40
@DanBurton: this is a database task, and database do not rely on swapping because it is too generic and hence inefficient. They move in and out pages with a specially-optimized layout, quite different from in-memory storage. — Blaisorblade, Feb 03 '12 at 22:54

score 4 · Answer 1 · answered Jan 30 '12 at 02:05

The easiest way to do something like this is to extend Traversable. You only have to define foreach, and you have full control over the traversal, so you can do things like open and close the file.

You can also extend Iterable, which requires defining iterator and, of course, returning some sort of Iterator. In this case, you'd probably create an Iterator for the disk data, but it's going to be much harder to control things like open files.

Here's one example of a Traversable such as I described, written by Josh Suereth:

class FileLinesTraversable(file: java.io.File) extends Traversable[String] {
  override def foreach[U](f: String => U): Unit = {
     val in = new java.io.BufferedReader(new java.io.FileReader(file))
     try {
       def loop(): Unit = in.readLine match {
          case null => ()
          case line => f(line); loop()
       }
       loop()
     } finally {
       in.close()
     }
  }
}

Interesting, but that would only solve the reading part (and without any kind of cache). I was looking for something for transparent in what concerns the `cons` operation, and the cache management. — Hugo Sereno Ferreira, Jan 30 '12 at 02:38
@HugoSFerreira It's not a solution to your problem, it's just an example of how `Traversable` can be extended to handle off-memory collections. — Daniel C. Sobral, Jan 30 '12 at 12:41

score 4 · Answer 2 · answered Feb 03 '12 at 23:38

You write:

mongodb, or any other non-embeddable resource, is an overkill

Do you know that there are embeddable database engines, including some really small ones? If you know, I'm not sure about your exact requirement and why would you not use them.

You sure that Hibernate + an embeddable DB (say SQLite) would not be enough? Alternatively, BerkeleyDB Java Edition, HSQLDB, or other embedded databases could be an option.

If you do not perform queries on the object themselves (and it really sounds like you do not), maybe serialization would be simpler than object-relational mapping for complex objects, but I've never tried, and I don't know which would be faster. But serialization is probably the only way to be completely generic in the type, assuming that your framework of choice offers a suitable interface to write [T <: Serializable]. If not, you could write [T: MySerializable] after creating your own "type-class" MySerializable[T] (like for instance Ordering[T] in the Scala standard library).

However, you don't want to use standard Java serialization for this task. "Anything serializable" sounds a bad requirement because it suggests the use of serialization for this, but I guess you can relax that to "anything serializable with my framework of choice". Serialization is extremely inefficient in time and space and is not designed to serialize a single object, instead it gives you back a file complete with special headers. I would suggest to use some different serialization framework - have a look here for a comparison.

Additional reasons not to go on the road of a custom implementation

In addition, it sounds like you would be reading the file essentially backward, and that's a quite bad access pattern, performance-wise, on non-SSD disks: after reading a sector, it takes an almost complete disk rotation to access the previous one.

Moreover, as Chris Shain pointed out in the comment above, you'd need to use a page-based solution, and you'd need to cope with variable-sized objects.

score 2 · Answer 3 · answered Feb 01 '12 at 14:54

2

This Java library may contain what you need. It aims to store entries in memory more efficiently than standard Java collections.

http://code.google.com/p/vanilla-java/wiki/HugeCollections

answered Feb 01 '12 at 14:54

Rich

15,093
14
75
122

1

Fast-forward 5 years, this project is now developed in https://github.com/OpenHFT/Chronicle-Queue and https://github.com/OpenHFT/Chronicle-Map – leventov Mar 18 '17 at 23:27
Thanks for adding the comment. Funnily enough my current project makes use of ChronicleMap for storing MD5 sums of data to detect duplicates! – Rich Mar 19 '17 at 07:48

score 2 · Answer 4 · answered Feb 08 '12 at 05:45

If you don't want to step up to one of the embeddable DBs, how about a stack in memory mapped files?

A stack seems to meet your desired access characteristics. (Push a bunch of data, and iterate over the most recently pushed data frequently)
You can use Java's MappedByteBuffer directly from Scala. You get to address the file like its memory, without trying to actually load the file into memory.
You'd get some caching for free from the OS this way, since the mapped file would function like virtual memory. Recently written/accessed pages would stay in the OSs file cache until the OS saw fit to flush them (or you flushed them manually) back to disk
You could build your stack from either end of the file if you're worried about sequential read performance, but if you're usually reading data you just wrote I wouldn't expect that would be a problem since it will still be in memory. (Though if you're reading data that youve written over hours/days across pages then it might be a problem)
A file addressed in this way is limited in size to 2GB even on a 64 bit JVM, but you can use multiple files to overcome this limitation.

The caching point is good, and I had thought about it, but I didn't figure out how to convert an array of objects to an array of byte in an efficient way. You cannot easily use a memory-mapped file as your heap, unlike in C (where it's also not that trivial, because you still need to implement a memory allocator). — Blaisorblade, Feb 09 '12 at 23:29

Disk-persisted-lazy-cacheable-List ™ in Scala

4 Answers4

Additional reasons not to go on the road of a custom implementation