32

It seems like I run into lots of situations where the appropriate way to build my data is to split it into two documents. Let's say it was for a chain of stores and you were saving which stores each customer had visited. Stores and Customers need to be independent pieces of data because they interact with plenty of other things, but we do need to relate them.

So the easy answer is to store the user's Id in the store document, or the store's Id in the user's document. Often times though, you want to access 1-2 other pieces of data for display purposes because Id's aren't useful. Like maybe the customer name, or the store name.

  1. Do you typically store a duplicate of the entire document? Or just store the pieces of data you need? Maybe depends on the size of the doc vs how much of it you need.
  2. How do you handle the fact that you have duplicate data? Do you go hunt down data when it changes? Update the data at some interval when it's loaded? Only duplicate when you can afford stale data?

Would appreciate your input and/or links to any kind of 'best practices' or at least well-reasoned discussion of these topics.

Cœur
  • 32,421
  • 21
  • 173
  • 232
Jim
  • 435
  • 5
  • 12

3 Answers3

31

There are basically two scenario's: fresh and stale.

Fresh data

Storing duplicate data is easy. Maintaining the duplicate data is the hard part. So the easiest thing to do is to avoid maintenance, by simply not storing any duplicate data to begin with. This is mainly useful if you need fresh data. Only store the references, and query the collections when you need to retrieve information.

In this scenario, you'll have some overhead due to the extra queries. The alternative is to track all locations of duplicate data, and update all instances on each update. This also involves overhead, especially in N-to-M relations like the one you mentioned. So either way, you will have some overhead, if you require fresh data. You can't have the best of both worlds.

Stale data

If you can afford to have stale data, things get a lot easier. To avoid query overhead, you can store duplicate data. To avoid having to maintain duplicate data, you're not going to store duplicate data. At least not actively.

In this scenario you'll also want to store only the references between documents. Then use a periodic map-reduce job to generate the duplicate data. You can then query the single map-reduce result, rather than separate collections. This way you avoid the query overhead, but you also don't have to hunt down data changes.

Summary

Only store references to other documents. If you can afford stale data, use periodic map-reduce jobs to generate duplicate data. Avoid maintaining duplicate data; it's complex and error-prone.

Niels van der Rest
  • 28,807
  • 15
  • 77
  • 86
  • 1
    Ok, in general this makes sense to me. The only thing I'm not entirely clear on is with the map-reduce result situation you described, it seems to assume that ALL data requires the same freshness. In the example here, the User data has to be fresh but the user's store names data can be stale. So I wouldn't want to read the user data with the store data from a periodic map-reduce, because the user data can't be stale. Does that force me entirely into the 'fresh' scenario then? – Jim Oct 20 '10 at 14:48
  • 1
    @Jim: If part of the data, in this case the visited store names, can be stale, you can use [Gates VP's solution](http://stackoverflow.com/questions/3956756/document-databses-redundant-data-references-etc-mongodb-specifically/3961368#3961368). Just remember to update the Customer documents as well when you update a store name in Stores. – Niels van der Rest Oct 20 '10 at 14:58
  • @NielsvanderRest cand you explain more about that map reduce ? – babak faghihian Nov 19 '15 at 06:53
  • I'm not familiar with the 'stale' and 'fresh' data terminology. What do these terms mean? – Hatshepsut Oct 05 '16 at 17:33
16

The answer here really depends on how current you need your data to be.

@Niels has a good summary here, but I think it's fair to note that you can "cheat".

Let's say that you want to display the Stores used by a User. The obvious problem here is that you can't "embed" the Store inside the User b/c the Store is too important on its own. But what you can do is embed some Store data in the User.

Just use the stuff you want for display like "Store Name". So your User object would look like this:

{
  _id : MongoID(),
  name : "Testy Tester",
  stores : [ 
             { _id : MongoID(), "name" : 'Safeway' },
             { _id : MongoID(), "name" : 'Walmart' },
             { _id : MongoID(), "name" : 'Best Buy' }
            ]
}

This way you can display the typical "grid" view, but require a link to get more data about the store.

Gates VP
  • 43,525
  • 11
  • 99
  • 107
  • 5
    +1 This is a good approach when the data is periodically generated of off existing data. If you manually insert the extra data, you'll have to update it manually as well. Of course, this isn't a problem for things that are unlikely to change, such as store names. – Niels van der Rest Oct 18 '10 at 20:46
2

To answer your direct questions:

  1. No duplicates.
  2. No duplicates.

;)

The only duplicates you should ever have are "simple" values like weights (which may happen to be the same, but aren't any more efficient in either time or space to store separately), and ids referencing another object (which are duplicate values, but much smaller and more manageable than the duplicate object data they replace).

Now, to answer your scenario: what you want is a Many-to-Many relationship. The usual solution here is to make a third "through" or "bridge" table/collection, probably called StoreUsers:

StoreUsers
----------
storeuser_id
store_id
user_id

You add a record to this for each link between stores and users, whether it's for a different store, a different user, or a bunch of users in one store. You can then look this up independently, for either the Store, or the User. MongoDB advocates this approach too; it's not RDBMS-specific.

Lee
  • 29
  • 2
  • 3
    Wait a minute! What is the difference between this and RDBMS then? – Vaibhav Apr 12 '16 at 20:47
  • 8
    Working on a big project with Mongo, I find this answer very disappointing if it's considered as the right approach. Mongo is way slower than any relational database as soon as you use references. I'm trying to fix this by checking the best way to duplicate data and all I see is people telling to do as in a relational database ... So why is MongoDB for ? – Laurent May 14 '16 at 08:39