0

Currently I have situation, which makes me a scared. I have like 20k rows in DB, which isn't even 1% of data, I will have in next 3 months. Each row represents an object(let's call it Object1) with some data. Also, I have table with stats for each Object1, let's call it Object1Stats, which is located in MongoDB. I have Object1Stats for each day, so to get, for example, total stats, I should sum every Object1Stats for Object1.

The problem is: often I need to have this data precalculated. For example, to display it to user, with ability to sort Object1 collection by stats. I could load it and sort in the code, but with, for example, 5 millions Object1, it will be too expensive.

So, I came with idea of precalculating stats each hour(Object1Stats updated twice in hour), for each Object1. But this process makes me afraid of time it will need to perform everything... I should take each Object1, send query to MongoDB to sum Object1Stats, make SQL UPDATE for Object1. Repeat this AT LEAST 3 million times.

I have 2 bottlenecks here: calculation of sum(MapReduce) in MongoDB and SQL update queries in Postgre. I can't really speedup MapReduce now(I assume it will be good), but I'm thinking about SQL updates.

Any thoughts or suggestions? I accept anything, even suggestions to use different DB or approach.

Also, I can't just add new stats data for object, because last day stats can be changed often, and previous days stats can be changed too.

Community
  • 1
  • 1
dt0xff
  • 1,503
  • 1
  • 9
  • 16
  • There is one rule of thumb in data modelling: "don't store aggregates". If you *do have to store them*, make sure they can be updated incrementally (eg by triggers on the *detail* table(s)) – joop Jun 12 '15 at 13:42

1 Answers1

0

Some ideas on the PostgreSQL end:

Use COPY to load the fresh data into a temporary table, then update the objects with a single query. It's faster than issuing every update separately. See also this answer. (If your driver allows it, then besides the COPY and multi-valued insert options there's also the option to pipeline).

Keep the often updated part of the object (the stats) in a separate table.

If you are sure that all the objects are updated, then you might want to load the updated stats with the COPY and then switch the tables (DROP TABLE stats; ALTER TABLE new_stats RENAME TO stats).

If, on the other hand, you are updating stats in a well-defined batches (e.g. first update the stats of objects 1..99999, then update the stats of objects 100000..199999, and so on), then you might partition the stats table according to these batches.

Another angle would be to load the stats directly from MongoDB, on demand, by using a foreign table wrapper. You might want to use a stored procedure for accessing the stats which will cache the stats in the local table. Updating the stats is then paramount to truncating the cache. The downside of this approach is that the PostgreSQL will issue a separate MongoDB request for every stat it fetches, so if your queries need to touch a lot of stats then this approach might be worse then the hourly batch update.

Yet another way is to create a MongoDB "river", a driver which will push the stat changes into the PostgreSQL as soon as they occur in MongoDB. That way you'll only pay for what you use, updating only those PostgreSQL objects that are indeed changed in the MongoDB. Also the load will be less rough. IMO this is the preferred way, but personally I don't know how hard it is to create a "river" driver.

P.S. Here's a blog post about using NOTIFY to update ES: http://evol-monkey.blogspot.ru/2014/08/postgresql-and-elasticsearch.html

Community
  • 1
  • 1
ArtemGr
  • 9,567
  • 2
  • 44
  • 75