4

I have recently started playing around with Luigi, and I would like to find out how to use it to continuously append new data into an existing target file.

Imagine I am pinging an api every minute to retrieve new data. Because a Task only runs if the Target is not already present, a naive approach would be to parameterize the output file by the current datetime. Here's a bare bones example:

import luigi
import datetime

class data_download(luigi.Task):
    date = luigi.DateParameter(default = datetime.datetime.now()) 

    def requires(self):
        return []

    def output(self):
        return luigi.LocalTarget("data_test_%s.json" % self.date.strftime("%Y-%m-%d_%H:%M"))

    def run(self):
        data = download_data()
        with self.output().open('w') as out_file:
            out_file.write(data + '\n')

if __name__ == '__main__':
    luigi.run()

If I schedule this task to run every minute, it will execute because the target file of the current time does not exist yet. But it creates 60 files a minute. What I'd like to do instead, is make sure that all the new data ends up in the same file eventually. What would be a scalable approach to accomplish that? Any ideas, suggestions are welcome!

mtoto
  • 21,499
  • 2
  • 49
  • 64
  • 1
    I don't think this use case is the right for Luigi. As it is designed to handle (large) batch processes. Not a continuos update of files. You could also implement this, as a single task that runs continuously for the time you need (i.e. N hours) and it writes each result of the api to the target file. – mfcabrera Mar 28 '17 at 16:04
  • Yeah, that's what I have come to realize. I will just implement a cronjob that writes to a new file each day and will use Luigi further downstream for processing. Thanks for the suggestion! – mtoto Mar 28 '17 at 19:23
  • 1
    Another idea is to just use a separate marker file to indicate the process is done. You can then delete this file as part of a cleanup task. – MattMcKnight Mar 31 '17 at 16:26
  • You can override the complete method to check for the last timestamp of the last line fl the file. – hirolau Sep 05 '17 at 20:33

2 Answers2

1

You cannot. As the doc for LocalTarget says:

Parameters: mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas w will open the FileSystemTarget in write mode. Subclasses can implement additional options.

I.e. only r or w modes are allowed. Additional options such as a require an extension of the LocalTarget class; despite it breaks the desired idempotency on Luigi task executions.

frb
  • 3,592
  • 2
  • 17
  • 48
0
def output(self):
        return luigi.LocalTarget("data_test_%s.json" % self.date.strftime("%Y-%m-%d_%H:%M"))

It's not the 'luigi way', but it does the job. In the end those targets are just file objects.

Nilesh Deokar
  • 2,777
  • 27
  • 47
Smilo
  • 16