PartitionedDataSet not found when Kedro pipeline is run in Docker

Question

I have multiple text files in an S3 bucket which I read and process. So, I defined PartitionedDataSet in Kedro datacatalog which looks like this:

raw_data:
  type: PartitionedDataSet
  path: s3://reads/raw
  dataset: pandas.CSVDataSet
  load_args:
    sep: "\t"
    comment: "#"

In addition, I implemented this solution to get all secrets from credentials file via environment variables including AWS secret keys.

When I run things locally using kedro run everything works just fine, but when I build Docker image (using kedro-docker) and run pipeline in Docker environement with kedro docker run and by providing all enviornement variables using --docker-args option I get the following error:

Traceback (most recent call last):
  File "/usr/local/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 724, in main
    cli_collection()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/kedro/kedro_cli.py", line 230, in run
    pipeline_name=pipeline,
  File "/usr/local/lib/python3.7/site-packages/kedro/framework/context/context.py", line 767, in run
    raise exc
  File "/usr/local/lib/python3.7/site-packages/kedro/framework/context/context.py", line 759, in run
    run_result = runner.run(filtered_pipeline, catalog, run_id)
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 101, in run
    self._run(pipeline, catalog, run_id)
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/sequential_runner.py", line 90, in _run
    run_node(node, catalog, self._is_async, run_id)
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 213, in run_node
    node = _run_node_sequential(node, catalog, run_id)
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 221, in _run_node_sequential
    inputs = {name: catalog.load(name) for name in node.inputs}
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 221, in <dictcomp>
    inputs = {name: catalog.load(name) for name in node.inputs}
  File "/usr/local/lib/python3.7/site-packages/kedro/io/data_catalog.py", line 392, in load
    result = func()
  File "/usr/local/lib/python3.7/site-packages/kedro/io/core.py", line 213, in load
    return self._load()
  File "/usr/local/lib/python3.7/site-packages/kedro/io/partitioned_data_set.py", line 240, in _load
    raise DataSetError("No partitions found in `{}`".format(self._path))
kedro.io.core.DataSetError: No partitions found in `s3://reads/raw`

Note: Pipeline works just fine in Docker environment, if I move files to some local directory, define PartitionedDataSet and build Docker image and provide environment variables through --docker-args

Kedro uses `fsspec` library to read the files from the location you specify in S3, for some reason `fsspec` fails to find any data in the path that you configured. Can you confirm the following: a) that the path is constructed correctly and the bucket name and common key prefix are valid, b) that the location does indeed have some files in it, c) that you haven't configures `filename_suffix` for your partitioned dataset in the catalog, d) that the keys you pass as the environment variables have enough permissions to read the data from your S3 bucket? — Dmitry Deryabin, Sep 23 '20 at 13:07
Also, would be great if you could post the `kedro docker` command where you pass the environment variables (with the keys truncated, obviously) — Dmitry Deryabin, Sep 23 '20 at 13:10
@DmitryDeryabin Thank you for your reply. a) Yes I would say they are because for the defined dataset in the catalog I obtain the dictionary with partition ids which are located in the designated s3 folder using context.io.load() and also when I run pipeline outside Docker the pipeline is executed normally with data being loaded and processed. This also answers b) (but I double checked it anyway and files are there) and also answers d). I have done all this outside Docker. For c), the dataset is defined as it is in the question above using only: type, path, dataset and those two load_args. — mendo, Sep 23 '20 at 13:31
and this is the `kedro docker` command I have used `kedro docker run --docker-args="--env AWS_ACCESS_KEY_ID=XXXXXXX --env AWS_SECRET_ACCESS_KEY=XXXXXXX --env USER=XXXXXXX --env PASSWORD=XXXXXX --env SERVERNAME=XXXXXX --env PORT=XXX --env NAME=XXXX"` — mendo, Sep 23 '20 at 13:34
@DmitryDeryabin I found the problem. The region `AWS_DEFAULT_REGION` was missing in the `kedro docker run` command. — mendo, Sep 24 '20 at 06:30
Oh, that's great @mendo! Glad that it's resolved now. It's still kinda weird that `AWS_DEFAULT_REGION` affects how `fsspec` lists your bucket since s3 is a global service not tied to a region... — Dmitry Deryabin, Sep 24 '20 at 14:06
I am not an AWS expert, but it could be how my AWS access account was setup. — mendo, Sep 25 '20 at 06:15

score 2 · Answer 1 · answered Sep 25 '20 at 06:18

2

The solution (at least in my case) was to provide AWS_DEFAULT_REGION env variable in the kedro docker run command.

answered Sep 25 '20 at 06:18

mendo

61
5

PartitionedDataSet not found when Kedro pipeline is run in Docker

1 Answers1