4

I've just started using the luigi library. I am regularly scraping a website and inserting any new records into a Postgres database. As I'm trying to rewrite parts of my scripts to use luigi, it's not clear to me how the "marker table" is supposed to be used.

Workflow:

  1. Scrape data
  2. Query DB to check if new data differs from old data.
  3. If so, store the new data in the same table.

However, using luigi's postgres.CopyToTable, if the table already exists, no new data will be inserted. I guess I should be using the inserted column in the table_updates table to figure out what new data should be inserted, but it's unclear to me what that process looks like and I can't find any clear examples online.

durrrutti
  • 950
  • 1
  • 7
  • 17

1 Answers1

2

You don't have to worry about marker table much: it's an internal table luigi uses to track which task has already been successfully executed. In order to do so, luigi uses the update_id property of your task. If you didn't declared one, then luigi will use the task_id as shown here. That task_id is a concatenation of the task family name and the first three parameters of your task.

The key here is to overwrite the update_id property of your task and return a custom string that you'll know will be unique for each run of your task. Usually you should use the significant parameters of your task, something like:

@property
def update_id(self):
    return ":".join(self.param1, self.param2, self.param3)

By significant I mean parameters that change the output of your task. I imagine parameters like website url o id, and scraping date. Parameters like the hostname, port, username or password of your database will be the same for any of these tasks so they shouldn't be considered significant.

Notice that without having details about your tables and the data you're trying to save its pretty hard to say how you must build that update_id string, so please be careful.

matagus
  • 5,430
  • 2
  • 24
  • 37