1

I am working on building my first Luigi pipeline, and I am currently testing tasks individually before building my dependencies. During testing, I am using a version of the following main method to build a task:

if __name__ == "__main__":

    headers = dict()
    headers["Content-Type"] = "application/json"
    headers["Accept"] = "application/json"

    luigi.build[(CSVValidator(jsonfile = '/sample_input/sample_csv.json',
                docfile = None,
                error_limit = 2,
                order_fields = 3,
                output_file = 'validation_is_us.txt',
                header = headers)])

    luigi.run()

This is what my csv_validator looks like:

class CSVValidator(luigi.Task):
    jsonfile = luigi.Parameter()
    docfile = luigi.Parameter()
    error_limit = luigi.Parameter()
    order_fields = luigi.Parameter()
    output_file = luigi.Parameter()
    header = luigi.DictParameter()

    def output(self):
        return luigi.LocalTarget(self.output_file + "/csv_validator_data_%s.txt" % time.time())

    def run(self):
        output_file = self.output().open('w')
        files = {}
        data = {}
        files["jsonfile"] = open(self.jsonfile, 'rb')
        files["docfile"] = open(self.docfile, 'rb')
        data["error_limit"] = self.error_limit
        data["order_fields"] = self.order_fields
        r = requests.post(*****~~~~~*****~~~~~,
                      headers=headers,
                      data=data, files=files)
        task_response = r.text.encode(encoding="UTF-8")
        print type(task_response)
        print(task_response)
        jsontaskdata = json.loads(task_response)
        json.dump(jsontaskdata, output_file)
        print("validated")
        output_file.close()

This task, however, is never actually run. Instead the luigi central scheduler claims that this task is already complete:

===== Luigi Execution Summary =====

Scheduled 2 tasks of which:
* 1 complete ones were encountered:
    - 1 CSVValidator(...)
* 1 ran successfully:
    - 1 Downloader(...)

This progress looks :) because there were no failed tasks or missing dependencies

Other tasks I have created, Downloader for example, do run successfully every time. What defines a complete task here? I don't understand what it means.

Thanks for your time!

Matthieu Brucher
  • 19,950
  • 6
  • 30
  • 49
Meg Anderson
  • 73
  • 11

1 Answers1

0

The Target-object which is returned by the output-method defines whether a Task is finished or not.

The object might be created if a certain output-file already exists, or some other condition, including the availability of external resources. For example, in luigi.contrib.esindex.py the (check for) existance of an ElasticSearch index in some remote cluster will create the Target-object and tell you that a task (CopyIndex) is complete.

You might also want to look at this answer: https://stackoverflow.com/a/34638943/4125622

And this discussion: https://github.com/spotify/luigi/issues/595

Phil
  • 5,442
  • 6
  • 36
  • 70