Method for indexing an object database

Question

I'm using an object database (ZODB) in order to store complex relationships between many objects but am running into performance issues. As a result I started to construct indexes in order to speed up object retrieval and insertion. Here is my story and I hope that you can help.

Initially when I would add an object to the database I would insert it in a branch dedicated to that object type. In order to prevent multiple objects representing the same entity I added a method that would iterate over existing objects in the branch in order to find duplicates. This worked at first but as the database grew in size the time it took to load each object into memory and check attributes grew exponentially and unacceptably.

To solve that issue I started to create indexes based on the attributes in the object so that when an object would be added it would be saved in the type branch as well as within an attribute value index branch. For example, say I was saving an person object with attributes firstName = 'John' and lastName = 'Smith', the object would be appended to the person object type branch and would also be appended to lists within the attribute index branch with keys 'John' and 'Smith'.

This saved a lot of time with duplicate checking since the new object could be analysed and only the set of objects which intersect within the attribute indexes would need to be checked.

However, I quickly ran into another issue with regards to dealing when updating objects. The indexes would need to updated to reflect the fact that they may not be accurate any more. This requires either remembering old values so that they could be directly accessed and the object removed or iterating over all values of an attribute type in order to find then remove the object. Either way performance is quickly beginning to degrade again and I can't figure out a way to solve it.

Has you had this kind of issue before? What did you do solve it, or is this just something that I have to deal with when using OODBMS's?

Thank in advance for the help.

"""However, I quickly ran into another issue with regards to dealing when updating objects. The indexes would need to updated to reflect the fact that they may not be accurate any more. This requires either remembering old values so that they could be directly accessed and the object removed or iterating over all values of an attribute type in order to find then remove the object""" This claim appears weird. The ZODB is ACID compatible and transaction. So you should be able to implement a storage being consistent with the indexes... — Andreas Jung, Jul 12 '11 at 17:22
@Blackmoon: So as I've been doing more research on this it seems that I'm trying to re-invent the wheel... repoze.catalog does a lot of the stuff that I'm trying to do. ZODB is ACID compatible and the safety of my data was never in question, it was that since ZODB doesn't have any native indexing facilities, the indexing becomes an application that has to be taken care of in python. My attempt at indexing was slowing this down, but it seems that repoze.catalog may be able to help with this. I'll report back after I try it out. — Evrim, Jul 12 '11 at 18:52

score 8 · Accepted Answer · answered Jul 13 '11 at 05:23

Yes, repoze.catalog is nice, and well documented.

In short : don't make indexing part of your site structure!

Look at using a container/item hierarchy to store and traverse content item objects; plan to be able to traverse content by either (a) path (graph edges look like a filesystem) or (b) by identifying singleton containers at some distinct location.
Identify your content using either RFC 4122 UUIDs (uuid.UUID type) or 64-bit integers.
Use a central catalog to index (e.g. repoze.catalog); the catalog should be at a known location relative to the root application object of your ZODB. And your catalog will likely index attributes of objects and return record-ids (usually integers) on query. Your job is to map those integer ids to (perhaps indrecting via UUIDs) to some physical traversal path in the database where you are storing content. It helps if you use zope.location and zope.container for common interfaces for traversal of your object graph from root/application downward.
Use zope.lifecycleevent handlers to index content and keep things fresh.

The problem -- generalized

ZODB is too flexible: it is just a persistent object graph with transactions, but this leaves room for you to sink or swim in your own data-structures and interfaces.

The solution -- generalized

Usually, just picking pre-existing idioms from the community around the ZODB will work: zope.lifecycleevent handlers, "containerish" traversal using zope.container and zope.location, and something like repoze.catalog.

More particular

Only when you exhaust the generalized idioms and know why they won't work, try to build your own indexes using the various flavors of BTrees in ZODB. I actually do this more than I care to admit, but usually have good cause.

In all cases, keep your indexes (search, discovery) and site (traversal and storage) structure distinct.

The idioms for the problem domain

Master ZODB BTrees: you likely want:
- To store content objects as subclasses of Persistent in containers that are subclasses of OOBTree providing container interfaces (see below).
- To store BTrees for your catalog or global indexes or use packages like repoze.catalog and zope.index to abstract that detail away (hint: catalog solutions typically store indexes as OIBTrees that will yield integer record ids for search results; you then typically have some sort of document mapper utility that translates those record ids into something resolvable in your application like a uuid (provided you can traverse the graph to the UUID) or a path (the way the Zope2 catalog does).
IMHO, don't bother working with intids and key-references and such (these are less idiomatic and more difficult if you don't need them). Just use a Catalog and DocumentMap from repoze.catalog to get results in integer to uuid or path form, and then figure out how to get your object. Note, you likely want some utility/singleton that has the job of retrieving your object given an id or uuid returned from a search.
Use zope.lifecycleevent or similar package that provides synchronous event callback (handler) registrations. These handlers are what you should call whenever an atomic edit is made on your object (likely once per transaction, but not in transaction machinery).
Learn the Zope Component Architecture; not an absolute requirement, but surely helpful, even if just to understand zope.interface interfaces of upstream packages like zope.container
Understanding of how Zope2 (ZCatalog) does this: a catalog fronts for multiple indexes or various sorts, which each search for a query, each have specialized data structures, and each return integer record id sequences. These are merged across indexes by the catalog doing set intersections and returned as a lazy-mapping of "brain" objects containing metadata stubs (each brain has a getObject() method to get the actual content object). Getting actual objects from a catalog search relies upon the Zope2 idiom of using paths from the root application object to identify the location of the item cataloged.

this is amazing. Thanks! It's given me a lot to think and try. I'll report back with my progress. — Evrim, Jul 13 '11 at 20:37

CAB · Answer 2 · 2011-07-12T18:13:24.967

0

Think about using an attribute hash (something like Java's hashCode()), then use the 32-bit hash value as the key. Python has a hash function, but I am not real familiar with it.

edited Jul 12 '11 at 18:13

answered Jul 12 '11 at 18:05

CAB

975
1
12
23

1

You can use any immutable python object as a key within a BTree, and speed isn't really affected by that. What seems to be slowing things down is my attempt at indexing which requires multiple write operations in order to work. As I mentioned above to @Blackmoon I've just discovered repoze.catalog and will give that a try and report back any progress. – Evrim Jul 12 '11 at 18:54