13

My initial files are in AWS S3. Could someone point me how I need to setup this in a Luigi Task?

I reviewed the documentation and found luigi.S3 but is not clear for me what to do with that, then I searched in the web and only get links from mortar-luigi and implementation in top of luigi.

UPDATE

After following the example provided for @matagus (I created the ~/.boto file as suggested too):

# coding: utf-8

import luigi

from luigi.s3 import S3Target, S3Client

class MyS3File(luigi.ExternalTask):
    def output(self):
        return S3Target('s3://my-bucket/19170205.txt')

class ProcessS3File(luigi.Task):

    def requieres(self):
        return MyS3File()

    def output(self):
        return luigi.LocalTarget('/tmp/resultado.txt')

    def run(self):
        result = None

        for input in self.input():
           print("Doing something ...")
           with input.open('r') as f:
               for line in f:
                   result = 'This is a line'

        if result:
            out_file = self.output().open('w')
            out_file.write(result)

When I execute it nothing happens

DEBUG: Checking if ProcessS3File() is complete
INFO: Informed scheduler that task   ProcessS3File()   has status   PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 21171] Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) running   ProcessS3File()
INFO: [pid 21171] Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) done      ProcessS3File()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   ProcessS3File()   has status   DONE
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) was stopped. Shutting down Keep-Alive thread

As you can see, the message Doing something... never prints. What is wrong?

nanounanue
  • 6,692
  • 7
  • 36
  • 65

1 Answers1

20

The key here is to define an External Task that has no inputs and which outputs are those files you already have in living in S3. Luigi docs mention this in Requiring another Task:

Note that requires() can not return a Target object. If you have a simple Target object that is created externally you can wrap it in a Task class

So, basically you end up with something like this:

import luigi

from luigi.s3 import S3Target

from somewhere import do_something_with


class MyS3File(luigi.ExternalTask):

    def output(self):
        return luigi.S3Target('s3://my-bucket/path/to/file')

class ProcessS3File(luigi.Task):

    def requires(self):
        return MyS3File()

    def output(self):
        return luigi.S3Target('s3://my-bucket/path/to/output-file')

    def run(self):
        result = None
        # this will return a file stream that reads the file from your aws s3 bucket
        with self.input().open('r') as f:
            result = do_something_with(f)

        # and the you 
        out_file = self.output().open('w')
        # it'd better to serialize this result before writing it to a file, but this is a pretty simple example
        out_file.write(result)

UPDATE:

Luigi uses boto to read files from and/or write them to AWS S3, so in order to make this code work, you'll need to provide your credentials in your boto config file ~/boto (look for other possible config file locations here):

[Credentials]
aws_access_key_id = <your_access_key_here>
aws_secret_access_key = <your_secret_key_here>
matagus
  • 5,430
  • 2
  • 24
  • 37
  • There are some problems with your code, please, Could you fix them? (e.g. the `return` in the first `output` method is wrong should be `return S3Target(...` – nanounanue Oct 31 '15 at 15:13
  • Another issue. In which part I should provide my `aws credentials`? – nanounanue Oct 31 '15 at 15:13
  • Done updating my answer. Hope it helps. – matagus Oct 31 '15 at 15:31
  • Should I use `S3Client` or something similar? – nanounanue Oct 31 '15 at 16:03
  • It's better if you use Luigi Targets. Tasks consume Targets that were created by some other task and they usually also output targets. And that way you can delegate to Luigi all retries and dependencies handling logic, which is why it was created. – matagus Oct 31 '15 at 16:23
  • I just open another [question](https://stackoverflow.com/questions/33579409/s3-file-to-local-using-luigi-raises-unicodedecodeerror) about an `UnicodeDecodeError` Do you have any idea? – nanounanue Nov 07 '15 at 05:20
  • Doesn't this solution read the entire content of the input file into memory before uploading it to S3? (or not?) If so, it isn't suitable for large datasets. – DavidJ Aug 08 '16 at 18:03
  • 1
    @DavidJ Luigi uses multipart upload with [64Mb chunks](https://github.com/spotify/luigi/blob/master/luigi/s3.py#L222) by default but you may specify another size by passing `part_size` parameter to `S3Target`. So the answer is yes, this is suitable for large files. – matagus Aug 08 '16 at 19:12
  • @matagus, I am getting following error "FileNotFoundError: [Errno 2] No such file or directory: 's3://PATH_TO_S3BUCKET_WITH_FILENAME", but when reading it read properly from S3BUCKET, can you help me with this? – Ganesh Deshvini Oct 18 '16 at 11:40