4

I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.

The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.

As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?

Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)

class Template(models.Model):
    template_name = models.TextField(_("Name"), max_length=70)
    sourcepackage = models.TextField(_("Source package"), max_length=70)
    translation_domain = models.TextField(_("Domain"), max_length=70)
    total = models.IntegerField(_("Total"))
    enabled = models.BooleanField(_("Enabled"))
    priority = models.IntegerField(_("Priority"))
    release = models.ForeignKey(Release) 

class Translation(models.Model):
    release = models.ForeignKey(Release)
    template = models.ForeignKey(Template)
    language = models.ForeignKey(Language)
    translated = models.IntegerField(_("Translated"))

And here's the bit of code that seems to take ages to complete:

@transaction.commit_manually
def add_translations(translation_data, lp_translation):

    releases = Release.objects.all()

    # There are 5 releases
    for release in releases:

        # translation_data has about 90K entries
        # this is the part that takes a long time
        for lp_translation in translation_data:
            try:
                language = Language.objects.get(
                    code=lp_translation['language'])
            except Language.DoesNotExist:
                continue

            translation = Translation(
                template=Template.objects.get(
                            sourcepackage=lp_translation['sourcepackage'],
                            template_name=lp_translation['template_name'],
                            translation_domain=\
                                lp_translation['translation_domain'],
                            release=release),
                translated=lp_translation['translated'],
                language=language,
                release=release,
                )

            translation.save()

        # I realize I should commit every n entries
        transaction.commit()

        # I've also got another bit of code to fill in some data I'm
        # not getting from the json files

        # Add missing templates
        languages = Language.objects.filter(visible=True)
        languages_total = len(languages)

        for language in languages:
            templates = Template.objects.filter(release=release)

            for template in templates:
                try:
                    translation = Translation.objects.get(
                                    template=template,
                                    language=language,
                                    release=release)
                except Translation.DoesNotExist:
                    translation = Translation(template=template,
                                              language=language,
                                              release=release,
                                              translated=0,
                                              untranslated=0)
                    translation.save()

            transaction.commit()
David Planella
  • 1,943
  • 2
  • 22
  • 30
  • Have a look at this recent answer, which may have some useful generic tips. http://stackoverflow.com/questions/9407442/optimise-postgresql-for-fast-testing/9407940#comment11914305_9407940 – Craig Ringer Feb 24 '12 at 03:11
  • Second part of the question was split out to a [follow-up question here](http://stackoverflow.com/q/9447506/939860) – Erwin Brandstetter Mar 05 '12 at 01:36

1 Answers1

5

Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.

If the import files are available locally to the server you can use COPY. Else you could use the meta command \copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.

If you just want to add new rows to a table:

COPY tbl FROM '/absolute/path/to/file' FORMAT csv;

Or if you want to INSERT / UPDATE some rows:

First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.

SET LOCAL temp_buffers='128MB';

In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:

select pg_size_pretty(pg_total_relation_size('tmp_x'));

Create the temporary table:

CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);

Or, to just duplicate the structure of an existing table:

CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;

Copy values (should take seconds, not hours):

COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;

From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:

ANALYZE tmp_x;

For instance, to update existing rows, matched by id:

UPDATE tbl
SET    col_a = tmp_x.col_a
USING  tmp_x
WHERE  tbl.id = tmp_x.id;

Finally, drop the temporary table:

DROP TABLE tmp_x;

Or have it dropped automatically at the end of the session.

Erwin Brandstetter
  • 479,275
  • 111
  • 893
  • 1,042
  • Thanks, `COPY` seems definitely interesting! Regarding inserting one row at a time, I would have thought `@transaction.commit_manually` would take care of batching the inserts when calling `transaction.commit()`. In any case, the source JSON files are indeed available to the server, and I can easily convert them to CSV, so I should be able to use it. Howerver, after reading the `COPY` documentation, I'm not sure how I can I use `COPY` and plain SQL to replicate the functionality of what I'm doing in the loop, in particular adding the foreign key value for 'template' to the 'translation' table. – David Planella Feb 24 '12 at 06:15
  • I could think of: 1) importing all the data to a temporary table with `COPY` 2) Using a separate SQL query to copy the non-foreign-key-related data from the temp table to the main 'translation' table, and while at it, find out the 'template' foreign key from the 'sourcepackage', 'template_name', 'translation_domain'values in the temp table and write the appropriate 'template' object to the main table. I'm still not sure how to write the SQL query on 2), but does that sound like a sensible approach? – David Planella Feb 24 '12 at 06:18
  • Yes, it certainly does. Should be possible in an SQL statement with JOINs. I added some more information to optimize the performance. – Erwin Brandstetter Feb 24 '12 at 07:00
  • I'm still not sure how to exactly write the SQL query with the JOINs, but that's probably subject for a separate question. I'm accepting the answer because it pointed me in the right direction and contains very useful info. Thanks! – David Planella Feb 24 '12 at 08:10