3

I have a collection of ~500M documents. Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.

My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?

What is the best practice in this case?

If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?

Guy Hudara
  • 237
  • 3
  • 13
  • 1
    Can you tell us more about the data? Ideally, you might even be able to add appropriate indices on existing columns right now. – Tim Biegeleisen Jul 26 '20 at 09:00
  • 1
    Of course indexs are good in this situation, without indexes, MongoDB must perform a collection scan, which may be much slower – White Link Jul 26 '20 at 09:10
  • I am aware of indexes and how to use them. My question is how to use the fact that more than 95% of my data is actually "garbage" to speed up the result. (No. I cannot delete the "garbage" since I don't know what is "garbage" and what is not). Only after a few months of real production work, I will know what documents are used. And sure, it may happen that an unused document become used. – Guy Hudara Jul 26 '20 at 12:08

1 Answers1

2

~500M documents That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:

  1. If you want to re-write/re-factor and rebuild the DB of an app. You could use versioning pattern.

How does it looks like?

Imagine you have a two collections (or even two databases, if you are using micro service architecture)

Relevant docs / Irrelevant docs.

Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find(). This pattern will allows you to store old/historical data. And manage it via TTL index or capped collection.

You could also add some Redis magic to it. (Which uses precisely the same logic), take a look: prototype

This article can also be helpful (as many others, like this SO question)

But don't try to replace Mongo with Redis, team them up instead.

  1. Using Indexes and .explain()

If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?

Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query. But don't forget about compound indexes! If you create field on one index, measure the performance by queering only this one field.

The result should been looks like this: Index useage

If your index have usage (and actually speed-up) Compass will shows you it.

To measure the performance of the queries (with and without indexing), use Explain tab.

Actually, all this part can be done without Compass itself, via .explain and .index queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.

AlexZeDim
  • 1,624
  • 1
  • 12
  • 31