38

I want to implement pagination on top of a MongoDB. For my range query, I thought about using ObjectIDs:

db.tweets.find({ _id: { $lt: maxID } }, { limit: 50 })

However, according to the docs, the structure of the ObjectID means that "ObjectId values do not represent a strict insertion order":

The relationship between the order of ObjectId values and generation time is not strict within a single second. If multiple systems, or multiple processes or threads on a single system generate values, within a single second; ObjectId values do not represent a strict insertion order. Clock skew between clients can also result in non-strict ordering even for values, because client drivers generate ObjectId values, not the mongod process.

I then thought about querying with a timestamp:

db.tweets.find({ created: { $lt: maxDate } }, { limit: 50 })

However, there is no guarantee the date will be unique — it's quite likely that two documents could be created within the same second. This means documents could be missed when paging.

Is there any sort of ranged query that would provide me with more stability?

Community
  • 1
  • 1
  • why would you think that ObjectId() doesn't give you stability? – Asya Kamsky Jan 10 '14 at 09:01
  • 2
    As in my question, according to the docs the structure of the ObjectID means that "ObjectId values do not represent a strict insertion order". –  Jan 10 '14 at 19:43

5 Answers5

62

It is perfectly fine to use ObjectId() though your syntax for pagination is wrong. You want:

 db.tweets.find().limit(50).sort({"_id":-1});

This says you want tweets sorted by _id value in descending order and you want the most recent 50. Your problem is the fact that pagination is tricky when the current result set is changing - so rather than using skip for the next page, you want to make note of the smallest _id in the result set (the 50th most recent _id value and then get the next page with:

 db.tweets.find( {_id : { "$lt" : <50th _id> } } ).limit(50).sort({"_id":-1});

This will give you the next "most recent" tweets, without new incoming tweets messing up your pagination back through time.

There is absolutely no need to worry about whether _id value is strictly corresponding to insertion order - it will be 99.999% close enough, and no one actually cares on the sub-second level which tweet came first - you might even notice Twitter frequently displays tweets out of order, it's just not that critical.

If it is critical, then you would have to use the same technique but with "tweet date" where that date would have to be a timestamp, rather than just a date.

Asya Kamsky
  • 39,247
  • 5
  • 96
  • 124
  • 6
    My main concern with using ObjectIDs for the range query was that they would not represent a strict order. If two documents are created using different MongoDB processes (I don't know when this would happen; perhaps on a sharded database), that would mean the middle two fragments of the ObjectID ("a 3-byte machine identifier, a 2-byte process id") would be different. I'm worried this sort of thing would throw off the ordering (and thus the range query). –  Jan 10 '14 at 19:55
  • yes, they do represent a strict order. ObjectId() will be sorted deterministically and consistently the same way. Think of it this way, if you are sorting "1,a,1" and "1,b,1" it doesn't matter than 1 is the same in both because a>b so those two will *always* be ordered the same relative to each other. – Asya Kamsky Jan 10 '14 at 22:24
  • Playing devil's advocate here. What if we were sorting "1,a,2" and "1,b,1"? The latter was created first (assuming the middle fragment is the process ID and the last fragment is the counter), but this is not how they will be sorted. I guess this is the extremely far edge case you mentioned? I only ask because it's good to understand! –  Jan 10 '14 at 23:04
  • My concern was that whole tweet documents would be missed out because of this edge case, but I think I realise now they will be included by the range query. The only real potential issue is the sorting might be incorrect. Right? –  Jan 10 '14 at 23:09
  • it's not possible to miss a document this way. And define "incorrect" sort order - if two tweets arrive within milliseconds of each other, is there really a "correct" order, and more importantly: "Does it matter?" – Asya Kamsky Jan 10 '14 at 23:10
  • Makes perfect sense now. Thank you very much! –  Jan 10 '14 at 23:11
  • This method is vulnerable to missing documents created within the same second (or within a few seconds with time drift) but it should be good enough. – Daniel Williams Jan 10 '14 at 23:42
  • No, it is *not* vulnerable to missing any documents, except the ones that are created *after* the query is executed, but that's normal, as things that are in the future will always be missed. Note there is no time criteria in my query, I'm getting all the tweets starting with the most recent one, regardless of the timestamp. – Asya Kamsky Jan 10 '14 at 23:49
  • 2
    That is the point. The next page will be read after the first query has executed so you will miss tweets if you are very concerned about keeping a correct stable order in your pages. Also if you move to the previous page you may see tweets inserted between already seen tweets which may be acceptable or not (this is usually a very poor UX though). – Daniel Williams Jan 11 '14 at 00:31
  • @AsyaKamsky: the reason I've discarded sorting by `_id` and suggested an aggregate field instead was my concern for the order tweets get imported and added to the collection. E.g. if this done in batches (polling every `t` etc.), how can you guarantee *any* relationship between `_id` and the tweet timestamp? Granted, we're talking about inconsistencies not exceeding `t`, but even that no longer qualifies as precise sorting. – Oleg Jan 11 '14 at 01:13
  • @o.v. let's not lose perspective here - we are sorting tweets, not auction bids or financial trades... – Asya Kamsky Jan 11 '14 at 09:16
  • 1
    quick question `<50th _id>` are you using this as some kind of place holder for the 50th _id? or it's a special querying parameter? because I couldn't find any other examples querying like that on google. it isn't even on the mongo docs for range querying – Zion Nov 07 '15 at 04:31
  • 1
    it's where you would put the value of the last _id you got from previous batch, so placeholder, not special syntax. Generally < > means "fill in appropriate value here" – Asya Kamsky Nov 08 '15 at 15:53
  • how can I use a different sort field, or even two fields to sort? – Sibelius Seraphini Apr 18 '16 at 00:30
  • you should probably start a different question if your problem is substantially different from this one. If it's extremely similar then just use the field you do have in place of _id in the examples. – Asya Kamsky Apr 18 '16 at 12:34
14

Wouldn't a tweet "actual" timestamp (i.e. time tweeted and the criteria you want it sorted by) be different from a tweet "insertion" timestamp (i.e. time added to local collection). This depends on your application, of course, but it's a likely scenario that tweet inserts could be batched or otherwise end up being inserted in the "wrong" order. So, unless you work at Twitter (and have access to collections inserted in correct order), you wouldn't be able to rely just on $natural or ObjectID for sorting logic.

Mongo docs suggest skip and limit for paging:

db.tweets.find({created: {$lt: maxID}).
          sort({created: -1, username: 1}).
          skip(50).limit(50); //second page

There is, however, a performance concern when using skip:

The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset increases, cursor.skip() will become slower and more CPU intensive.

This happens because skip does not fit into the MapReduce model and is not an operation that would scale well, you have to wait for a sorted collection to become available before it can be "sliced". Now limit(n) sounds like an equally poor method as it applies a similar constraint "from the other end"; however with sorting applied, the engine is able to somewhat optimize the process by only keeping in memory n elements per shard as it traverses the collection.

An alternative is to use range based paging. After retrieving the first page of tweets, you know what the created value is for the last tweet, so all you have to do is substitute the original maxID with this new value:

db.tweets.find({created: {$lt: lastTweetOnCurrentPageCreated}).
          sort({created: -1, username: 1}).
          limit(50); //next page

Performing a find condition like this can be easily parallellized. But how to deal with pages other than the next one? You don't know the begin date for pages number 5, 10, 20, or even the previous page! @SergioTulentsev suggests creative chaining of methods but I would advocate pre-calculating first-last ranges of the aggregate field in a separate pages collection; these could be re-calculated on update. Furthermore, if you're not happy with DateTime (note the performance remarks) or are concerned about duplicate values, you should consider compound indexes on timestamp + account tie (since a user can't tweet twice at the same time), or even an artificial aggregate of the two:

db.pages.
find({pagenum: 3})
> {pagenum:3; begin:"01-01-2014@BillGates"; end:"03-01-2014@big_ben_clock"}

db.tweets.
find({_sortdate: {$lt: "03-01-2014@big_ben_clock", $gt: "01-01-2014@BillGates"}).
sort({_sortdate: -1}).
limit(50) //third page

Using an aggregate field for sorting will work "on the fold" (although perhaps there are more kosher ways to deal with the condition). This could be set up as a unique index with values corrected at insert time, with a single tweet document looking like

{
  _id: ...,
  created: ...,    //to be used in markup
  user: ...,    //also to be used in markup
  _sortdate: "01-01-2014@BillGates" //sorting only, use date AND time
}
Community
  • 1
  • 1
Oleg
  • 22,838
  • 4
  • 55
  • 82
  • 1
    I thought it was bad practice to use `skip`? http://docs.mongodb.org/manual/reference/method/cursor.skip/ –  Jan 09 '14 at 10:33
  • @OliverJosephAsh: it is expensive, but it seems like the only way to achieve pagination (the example is even in the docos). I can't think of a way to use range based paging in your example (i.e. a query for "next 10" results is trivial, but how would you calculate a correct range for jumping to page, say, 55?). I'll expand my answer shortly. – Oleg Jan 09 '14 at 20:48
  • Jumping to a page isn't something I require so much of. I just need to be able to say "give me the next 50 documents". Sorry if I confused things by calling it pagination. –  Jan 09 '14 at 22:54
  • @OliverJosephAsh: I hope you're not using one of those "infinite scroll" pages :) Anyway I've updated my answer with a few alternatives, power to you – Oleg Jan 10 '14 at 00:15
  • 1
    it is absolutely fine to use skip (just not for skipping 70% of your collection or anything like that. @o.v. I have no idea what you mean by bringing in MapReduce into this - it has *nothing* to do with querying. Unique indexes are also completely inappropriate for this. – Asya Kamsky Jan 10 '14 at 09:08
  • @AsyaKamsky What about something like a "tweets" collection which contains tweets from one user? My query could be for tweets from a specific or a selection of users. In that case it's quite likely I will be skipping a large proportion of the collection. –  Jan 10 '14 at 10:17
  • @AsyaKamsky: please correct and/or elaborate: 1) In terms of performance, `skip` is never a preferred option over selecting documents by range. 2) The engine leverages MapReduce pattern with `find` being effectively syntactic sugar for "map on criteria x", and `limit` - "reduce to an array of length not exceeding y". This is unlike `skip`, which cannot be performed as a reduce operation since it cannot be sensibly performed on partial collections. 3) An aggregate field would satisfy the sorting requirement perfectly, wouldn't it make sense to set it up as an index for performance? – Oleg Jan 10 '14 at 11:52
  • @OliverJosephAsh: if you're talking about `db.tweets.find({user: 'oliver'}).skip(9000)`, you're not actually skipping non-oliver users. I speculate that performance *will* be affected on larger collections as filtering takes longer, because `skip` will have to wait for that to complete, which is the real bottleneck. A good analogy is the sql anti-pattern where a developer selects `*` from the db and then performs sorting/filtering/limiting with server-side code - such a query takes longer to complete, transfer to and get processed by another process, and would use much more memory. – Oleg Jan 10 '14 at 12:22
  • 1
    skip is a poor choice for paging except for some specific conditions. it is essentially like using LIMIT X,Y in mysql which you can find quite a few articles about why it is a poor choice. – Daniel Williams Jan 10 '14 at 19:57
  • @OliverJosephAsh you will not be using skip() to skip other users' tweets, you will have a query which says "this user's tweets only" and on *that* set of results you do your sort, limit or sort, skip, limit. – Asya Kamsky Jan 10 '14 at 22:27
  • @o.v. you are completely wrong when you conflate MapReduce (which runs in javascript thread outside of mongodb engine) with find/query which runs inside the server which is all C++ code. The DB engine does not ever leverage MapReduce for anything (only explicit calls to JS will do that) and find is NOT syntactic sugar for map/reduce - that would be some other key/value stores, not MongoDB. I would really recommend that you remove your answer as it is not based on how MongoDB works. – Asya Kamsky Jan 10 '14 at 22:32
  • @AsyaKamsky: Just to be clear, I'm **not** talking about [`mapReduce` method](http://docs.mongodb.org/manual/reference/method/db.collection.mapReduce/) *which runs in javascript thread outside of mongodb engine*, I'm referring to [`MapReduce design patterns`](http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/) and mongo leveraging them in its implementation (i.e. grepping pattern for `find` method). If this isn't a case of homonymous terms getting confused, I'd really like to be pointed to an authoritative overview of engine mechanics and correct my understanding of the subject – Oleg Jan 11 '14 at 00:59
  • There is no such framework or design pattern in mongodb engine - mongodb querying is patterned after RDBMS querying, along with B-tree indexes and ability to specify multiple filters, sort, limit and skip which all use indexes if available. If not they do what RDMBS do - table scan. – Asya Kamsky Jan 11 '14 at 09:15
  • @o.v. What are other kosher ways to deal with the situation ? – Anand May 26 '17 at 06:56
1

The following approach wil work even if there are multiple documents inserted/updated at same millisecond even if from multiple clients (which generates ObjectId). For simiplicity, In following queries I am projecting _id, lastModifiedDate.

  1. First page, fetch the result Sorted by modifiedTime (Descending), ObjectId (Ascending) for fist page.

    db.product.find({},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)

Note down the ObjectId and lastModifiedDate of the last record fetched in this page. (loid, lmd)

  1. For sencod page, include query condition to search if (lastModifiedDate = lmd AND oid > loid ) OR (lastModifiedDate < loid)

db.productfind({$or:[{"lastModifiedDate":{$lt:lmd}},{"_id":1,"lastModifiedDate":1},{$and:[{"lastModifiedDate":lmd},{"_id":{$gt:loid}}]}]},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)

repeat same for subsequent pages.

Poorna
  • 1,820
  • 1
  • 19
  • 24
0

ObjectIds should be good enough for pagination if you limit your queries to the previous second (or don't care about the subsecond possibility of weirdness). If that is not good enough for your needs then you will need to implement an ID generation system that works like an auto-increment.

Update:

To query the previous second of ObjectIds you will need to construct an ObjectID manually.

See the specification of ObjectId http://docs.mongodb.org/manual/reference/object-id/

Try using this expression to do it from a mongos.

{ _id : 
  {
      $lt : ObjectId(Math.floor((new Date).getTime()/1000 - 1).toString(16)+"ffffffffffffffff")
  }

}

The 'f''s at the end are to max out the possible random bits that are not associated with a timestamp since you are doing a less than query.

I recommend during the actual ObjectId creation on your application server rather than on the mongos since this type of calculation can slow you down if you have many users.

Daniel Williams
  • 8,139
  • 3
  • 32
  • 46
  • Interesting, thank you. How would you query with an ID, minus one second? –  Jan 06 '14 at 22:59
  • It might be easier to just use the date as I am already storing that? If I did do it as you demonstrated above, I imagine I would need to substitute the input date for the timestamp of the ObjectID I want it to be less than? –  Jan 06 '14 at 23:30
  • Yes you would need to substitute the date with the timestamp of the ObjectId. Using the stored date you have would be adequate if there are very few subsecond tweets or else the ObjectId is your best bet. – Daniel Williams Jan 06 '14 at 23:37
  • I don't see any difference between using a stored date or using the timestamp from an ObjectID. Surely they both could introduce "subsecond weirdness"? –  Jan 08 '14 at 12:43
  • If you aren't also sorting your results by ObjectId then there should be subsecond weirdness using just the date constraint. Also using date won't let you paginate if you have enough tweets created in one second to exceed your results per page. It's an edge case. – Daniel Williams Jan 08 '14 at 21:10
  • I'm afraid I don't follow what you're saying. I still don't see any difference between a stored date and a timestamp extracted from an ObjectId. If you could provide any more detail, that would be appreciated. Otherwise I know now that the only way to achieve paging without these edge cases is to implement my own ID generation system. –  Jan 09 '14 at 00:43
  • Why would there be any "sub-second weidness"? ObjectId() are well ordered. – Asya Kamsky Jan 10 '14 at 09:08
  • Can you explain the difference between using a stored date versus a timestamp extracted from an ObjectID for the query? –  Jan 10 '14 at 19:51
  • There are a few differences. Your sharding configuration might be an issue if you have a large cluster. When querying on _id mongo will cheat and use internal optimizations and you must always query on your shard key to be efficient so if you query by _id and shard by _id you will be ok. (Sharding by _id isn't usually a good idea though). However the sort matters and _id is already indexed so you will need to add another index for your stored timestamp (which is really redundant if you use ObjectIDs). – Daniel Williams Jan 10 '14 at 19:54
  • time stamp will have greater likelihood of collision - many tweets will be created on the same millisecond, ObjectId() uses other properties to ensure its uniqueness. Since they are unique and well ordered, you can sort by them deterministically, which is not the case for timestamps that are not unique. – Asya Kamsky Jan 10 '14 at 22:26
  • I would really recommend this approach and adjust the timestamp differential possibly to as much as 30 seconds instead of 1 second to ensure your pages will always be fully formed and you won't see any weirdness. If you are worried about time drift on your machines you should also look into it since there are many good linux services to handle it and mongo will misbehave if the time drift is radical enough or changes too quickly. – Daniel Williams Jan 10 '14 at 22:36
  • I don't understand why you would include second or 30 seconds in the query - why wouldn't you just start with most recent and display in reverse order? I don't see how pages could not be "fully formed" unless you simply didn't have as many as 50 tweets in which case you don't have to worry about pagination. – Asya Kamsky Jan 10 '14 at 23:12
  • ObjectIds are not a guaranteed order. If your machines have a time drift they may be significantly out of order since the most significant bits are a second resolution timestamp. Restricting by some fixed amount of time in the past is to guarantee pages are fully formed since in well maintained systems a time drift greater than 30 seconds is unheard of. – Daniel Williams Jan 10 '14 at 23:38
  • it's irrelevant if the exact time represented by ObjectId is "incorrect" - if a tweet is displayed as being several seconds later than it actually was, it doesn't matter. The order will be preserved. – Asya Kamsky Jan 10 '14 at 23:50
  • The point is that the order is unstable during a small (or sometimes large) time window. If this were a sorted set you would see [a,b] then maybe [a,b,c] then maybe [a,d,b,f,c] over time. ObjectIds do not guarantee an order like an auto increment. They only try to guarantee uniqueness in a distributed system. – Daniel Williams Jan 10 '14 at 23:55
  • Once object ids are created they are fixed, immutable, represent some point in time and are well ordered - so the sort order is deterministic and repeatable. If it's not 100% identical to actual tweet order (by seconds or even less) it's not really important in the given use case. If you are concerned that an ObjectId() that's "out of order" will come from a system that has clock skew then the correct solution to that would be to use ntpd or equivalent to keep clocks in sync. Using a timestamp to any granularity won't help if you have clock skew anyway, you will be prone to the same "issue". – Asya Kamsky Jan 12 '14 at 08:55
  • They are not well ordered and that is the point. In a single second, the ordering is unstable. Across multiple seconds they may also be unstable do to time drift and latency. It says exactly this in the official documentation.... – Daniel Williams Jan 13 '14 at 11:19
0

I have build a pagination using mongodb _id this way.

// import ObjectId from mongodb
let sortOrder = -1;
let query = []
if (prev) {
    sortOrder = 1
    query.push({title: 'findTitle', _id:{$gt: ObjectId('_idValue')}})
}

if (next) {
    sortOrder = -1
    query.push({title: 'findTitle', _id:{$lt: ObjectId('_idValue')}})
}

db.collection.find(query).limit(10).sort({_id: sortOrder})