Django design pattern for web analytics screens that take a really long time to calculate

Question

I have an "analytics dashboard" screen that is visible to my django web applications users that takes a really long time to calculate. It's one of these screens that goes through every single transaction in the database for a user and gives them metrics on it.

I would love for this to be a realtime operation, but calculation times can be 20-30 seconds for an active user (no paging allowed, it's giving averages on transactions.)

The solution that comes to mind is to calculate this in the backend via a manage.py batch command and then just display cached values to the user. Is there a Django design pattern to help facilitate these types of models/displays?

score 14 · Accepted Answer · answered Jul 31 '11 at 12:11

What you're looking for is a combination of offline processing and caching. By offline, I mean that the computation logic happens outside the request-response cycle. By caching, I mean that the result of your expensive calculation is sufficiently valid for X time, during which you do not need to recalculate it for display. This is a very common pattern.

Offline Processing

There are two widely-used approaches to work which needs to happen outside the request-response cycle:

Cron jobs (often made easier via a custom management command)
Celery

In relative terms, cron is simpler to setup, and Celery is more powerful/flexible. That being said, Celery enjoys fantastic documentation and a comprehensive test suite. I've used it in production on almost every project, and while it does involve some requirements, it's not really a bear to setup.

Cron

Cron jobs are the time-honored method. If all you need is to run some logic and store some result in the database, a cron job has zero dependencies. The only fiddly bits with cron jobs is getting your code to run in the context of your django project -- that is, your code must correctly load your settings.py in order to know about your database and apps. For the uninitiated, this can lead to some aggravation in divining the proper PYTHONPATH and such.

If you're going the cron route, a good approach is to write a custom management command. You'll have an easy time testing your command from the terminal (and writing tests), and you won't need to do any special hoopla at the top of your management command to setup a proper django environment. In production, you simply run path/to/manage.py yourcommand. I'm not sure if this approach works without the assistance of virtualenv, which you really ought to be using regardless.

Another aspect to consider with cronjobs: if your logic takes a variable amount of time to run, cron is ignorant of the matter. A cute way to kill your server is to run a two-hour cronjob like this every hour. You can roll your own locking mechanism to prevent this, just be aware of this—what starts out as a short cronjob might not stay that way when your data grows, or when your RDBMS misbehaves, etc etc.

In your case, it sounds like cron is less applicable because you'd need to calculate the graphs for every user every so often, without regards to who is actually using the system. This is where celery can help.

Celery

…is the bee's knees. Usually people are scared off by the "default" requirement of an AMQP broker. It's not terribly onerous setting up RabbitMQ, but it does require stepping outside of the comfortable world of Python a bit. For many tasks, I just use redis as my task store for Celery. The settings are straightforward:

CELERY_RESULT_BACKEND = "redis"
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_CONNECT_RETRY = True

Voilá, no need for an AMQP broker.

Celery provides a wealth of advantages over simple cron jobs. Like cron, you can schedule periodic tasks, but you can also fire off tasks in response to other stimuli without holding up the request/response cycle.

If you don't want to compute the chart for every active user every so often, you will need to generate it on-demand. I'm assuming that querying for the latest available averages is cheap, computing new averages is expensive, and you're generating the actual charts client-side using something like flot. Here's an example flow:

User requests a page which contains an averages chart.
Check cache -- is there a stored, nonexpired queryset containing averages for this user?
- If yes, use that.
- If not, fire off a celery task to recalculate it, requery and cache the result. Since querying existing data is cheap, run the query if you want to show stale data to the user in the meantime.
If the chart is stale. optionally provide some indication that the chart is stale, or do some ajax fanciness to ping django every so often and ask if the refreshed chart is ready.

You could combine this with a periodic task to recalculate the chart every hour for users that have an active session, to prevent really stale charts from being displayed. This isn't the only way to skin the cat, but it provides you with all the control you need to ensure freshness while throttling CPU load of the calculation task. Best of all, the periodic task and the "on demand" task share the same logic—you define the task once and call it from both places for added DRYness.

Caching

The Django cache framework provides you with all the hooks you need to cache whatever you want for as long as you want. Most production sites rely on memcached as their cache backend, I've lately started using redis with the django-redis-cache backend instead, but I'm not sure I'd trust it for a major production site yet.

Here's some code showing off usage of the low-level caching API to accomplish the workflow laid out above:

import pickle
from django.core.cache import cache
from django.shortcuts import render
from mytasks import calculate_stuff

from celery.task import task

@task
def calculate_stuff(user_id):
    # ... do your work to update the averages ...
    # now pull the latest series
    averages = TransactionAverage.objects.filter(user=user_id, ...)
    # cache the pickled result for ten minutes 
    cache.set("averages_%s" % user_id, pickle.dumps(averages), 60*10)

def myview(request, user_id):
    ctx = {}
    cached = cache.get("averages_%s" % user_id, None)
    if cached:
        averages = pickle.loads(cached) # use the cached queryset
    else:
        # fetch the latest available data for now, same as in the task
        averages = TransactionAverage.objects.filter(user=user_id, ...)
        # fire off the celery task to update the information in the background
        calculate_stuff.delay(user_id) # doesn't happen in-process.
        ctx['stale_chart'] = True # display a warning, if you like

    ctx['averages'] = averages
    # ... do your other work ...
    render(request, 'my_template.html', ctx)

Edit: worth noting that pickling a queryset loads the entire queryset into memory. If you're pulling up a lot of data with your averages queryset this could be suboptimal. Testing with real-world data would be wise in any case.

score 3 · Answer 2 · answered Jul 31 '11 at 03:47

Simplest and IMO correct solution for such scenarios is to pre-calculate everything as things are updated, so that when user sees dashboard you calculate nothing but just display already calculated values.

There can be various ways to do that, but generic concept is to trigger a calculate function in background when something on which calculation depends changes.

For triggering such calculation in background I usually use celery, so suppose user adds a item foo in view view_foo, we call a celery task update_foo_count which will be run in background and will update foo count, alternatively you can have a celery timer which will update count say every 10 minutes by checking if re-calculation need to be done, recalculate flag can be set at various places where user updates data.

score 1 · Answer 3 · answered Jul 31 '11 at 02:18

1

You need to have a look at Django’s cache framework.

answered Jul 31 '11 at 02:18

Dominik

377
4
11

I need to fix/avoid even the initial 20-30 second load time. So I need a backend caching design pattern. – MikeN Jul 31 '11 at 02:21

score 0 · Answer 4 · answered Jul 31 '11 at 10:45

0

If the data that is slow to compute can be denormalised and stored when data is added, rather than when it is viewed, then you may be interested in django-denorm.

answered Jul 31 '11 at 10:45

Max Peterson

583
5
5

Django design pattern for web analytics screens that take a really long time to calculate

4 Answers4

Offline Processing

Cron

Celery

Caching