1

I'd like a class that provides an interface to some analysis results, but the source of the results are determined at runtime. The very first time a result is calculated, this must come from the original data file. This calculation takes a long time, so I'd like to store this in a result file for quick access in the future (I plan to use msgpack for this). But if the result is still in memory then I'd like to use that version, so I don't need to keep reading the result file.

The program loops through many of these classes, so there is potentially a lot of memory usage. For this reason, I think it would be best to have a global cache of fixed size (or time limit). Unfortunately, which results will be needed isn't known in advance, so I can't handle this with some preprocessing step. I need the ability to get a new result from the original file if needed.

So I imagine the class will look something like:

class MyClass(object)
    def __init__(self, orig_file, result_file, cache):
        self.orig_file = orig_file
        self.result_file = result_file
        self.cache = cache

    def get_result(self, name):
        if name in self.cache:
            return self.cache[name]

        elif name in self.result_file:
            return self.result_file[name]

        else:  # compute for first time
            return self.orig_file.calculate(name)

Initial searches returned so many potential routes that I became overwhelmed. I'm considering some kind of persistent memoization like this, but I need the persistence to be specific to a class instance. Then I found this persistent lazy caching dictionary, which looks very nice (though I don't need lazy evaluation), but it's not clear how to work the original file into this.

Are there any tools which go some way towards answering this question? And is it reasonable to attempt to achieve this with decorators? (e.g. decorate methods that would do this source search)

Community
  • 1
  • 1
David Hall
  • 460
  • 4
  • 13
  • ```The program loops through many of these classes,``` - Are you saying that there will (potentially) be many instances of the class and each one will need access to the *same* data? – wwii Mar 12 '16 at 06:29
  • There are many instances of the class, but each one will access a different original file. – David Hall Mar 12 '16 at 06:33
  • My suggestion: don't bother doing your own in-memory caching, just store the data in your result files, and rely on your OS's file caching to provide fast access to that data. – PM 2Ring Mar 12 '16 at 07:44

1 Answers1

1

Have you looked at joblib?

It basically does these three functions:

  1. transparent disk-caching of the output values and lazy re-evaluation (memoize pattern)
  2. easy simple parallel computing
  3. logging and tracing of the execution

I ran across it while searching for number 2 and am considering using it, but I recalled seeing that it had memoization and caching as functionality as well.

bastula
  • 181
  • 2
  • 12
  • Thanks Aditya. This does look like a useful package, but in the end I wrote my own decorator that made use of `h5py`. I'm a big fan of HDF5. :-) – David Hall Jul 21 '16 at 14:55
  • Fair enough. Thanks for accepting the answer. Btw, thanks for the great work on Topas. – bastula Jul 22 '16 at 02:16