6

I am running a batch operation over all rows in a database. This involves selecting every single model and doing something to it. It makes sense to split this into chunks and do it chunk by chunk.

I'm currently using Paginator, because it's convenient. This means that I need an ordering on the values so that they can be paged through in order. This does generate SQL statements which have order and limit clauses, and for each chunk I think Postgres may be sorting the whole table (although I can't claim to have any knowledge about internals). All I know is that the database is at around 50% CPU and I think that's too high just to be doing selects.

What's the best way to iterate over the entire table in a RDMBS/CPU-friendly way?

Assuming that the contents of the database doesn't change during the batch operation.

Charles
  • 48,924
  • 13
  • 96
  • 136
Joe
  • 42,600
  • 24
  • 134
  • 225

2 Answers2

6

From your description you don't actually care about the sort order of the rows you process. If you have primary keys in your tables (which I expect!), this crude method of partitioning would be much faster:

SELECT * FROM tbl WHERE id BETWEEN 0    AND 1000;
SELECT * FROM tbl WHERE id BETWEEN 1001 AND 2000;
...

This performs the same for any offset and (almost) the same for any size of table. Retrieve min and max of your primary key and partition accordingly:

SELECT min(id), max(id) from tbl; -- then divide in suitable chunks

As opposed to:

SELECT * FROM tbl ORDER BY id LIMIT 1000;
SELECT * FROM tbl ORDER BY id LIMIT 1000 OFFSET 1000;
...

This is generally slower because all rows have to be sorted and performance degrades additionally with higher offsets and bigger tables.

Erwin Brandstetter
  • 479,275
  • 111
  • 893
  • 1,042
  • This assumes that the records are returned in the same order without a `sort` clause. Is this correct? Also, if I have a default sorting in my `Meta` class can I somehow remove it for the query? – Joe Jan 03 '12 at 12:41
  • @Joe: Basically you get the same records, but unsorted. If there are gaps in your ID-space, the number of records returned can be less than expected for every call. While with LIMIT / OFFSET you get a fixed number of sorted rows (except of the last call per table). I don't how to handle the `Meta` class, but you *need* to order your rows for LIMIT / OFFSET. – Erwin Brandstetter Jan 03 '12 at 14:09
  • Erwin, I'm really sorry I didn't read your answer correctly. Are you sure this is faster? the `between` clause can surely only work if either the ids are already sorted, or it performs a whole table scan each time? – Joe Jan 03 '12 at 15:27
  • @Joe: There is no natural order in a table. Of course, if the physical order of rows matches the index, this may speed up the operation as fewer blocks have to be read. You may be interested in [`CLUSTER`](http://www.postgresql.org/docs/current/interactive/sql-cluster.html) And no, my query will result in an *index scan*, if `id` is indexed, but only if the chunk you read is a small portion of the table. The query planner decides which will be faster: index or seq scan. Just test with [`EXPLAIN ANALYZE`](http://www.postgresql.org/docs/current/interactive/sql-explain.html) and see for yourself. – Erwin Brandstetter Jan 03 '12 at 15:39
5

The following code implements Erwin's answer above (using BETWEEN) for a Django QuerySet:

A utility function that will do this for an arbitrary Django QuerySet is as follows. It defaults to assuming 'id' is a suitable field to use for the between clause.

def chunked_queryset(qs, batch_size, index='id'):
    """
    Yields a queryset split into batches of maximum size 'batch_size'.
    Any ordering on the queryset is discarded.
    """
    qs = qs.order_by()  # clear ordering
    min_max = qs.aggregate(min=models.Min(index), max=models.Max(index))
    min_id, max_id = min_max['min'], min_max['max']
    for i in range(min_id, max_id + 1, batch_size):
        filter_args = {'{0}__range'.format(index): (i, i + batch_size - 1)}
        yield qs.filter(**filter_args)

It would be used like this:

for chunk in chunked_queryset(SomeModel.objects.all(), 20):
    # `chunk` is a queryset
    for item in chunk:
        # `item` is a SomeModel instance
        pass

You could also change the interface so that you didn't need the extra nested loop, but could do for item in chunked_queryset(qs):

def chunked_queryset(qs, batch_size, index='id'):
    """
    Yields a queryset that will be evaluated in batches
    """
    qs = qs.order_by()  # clear ordering
    min_max = qs.aggregate(min=models.Min(index), max=models.Max(index))
    min_id, max_id = min_max['min'], min_max['max']
    for i in range(min_id, max_id + 1, batch_size):
        filter_args = {'{0}__range'.format(index): (i, i + batch_size - 1)}
        for item in qs.filter(**filter_args):
            yield item
spookylukey
  • 5,808
  • 1
  • 26
  • 31