I have a dataflow made with apache-beam in python 3.7 where I process a file and then I have to delete it. The file comes from a google Storage bucket, and the problem is that when I use the DataflowRunner runner my job doesn't work because google-cloud-storage API is not installed in the Google Dataflow python 3.7 environment. Do you know guys how could I delete this file inside my dataflow without using this API? I've seen apache_beam modules like https://beam.apache.org/releases/pydoc/2.22.0/apache_beam.io.filesystem.html but I don't have any idea of how to use it, and haven't found a tutorial or example on how to use this module.
Asked
Active
Viewed 399 times
1
![](../../users/profiles/4674937.webp)
Felipe Sierra
- 53
- 1
- 7
-
You should be able to include the GCS Python library when you run the pipeline by following https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/. – Brent Worden Jul 17 '20 at 01:55
1 Answers
2
I don't think you can delete while running the dataflow job. You have to delete the file after the dataflow job is completed. I normally recommend some kind of orchestration like apache airflow or Google Cloud Composer.
You can make a DAG in airflow as follows -
Here,
"Custom DAG Workflow" will have the dataflow job.
"Custom Python Code" will have the python code to delete the file
![](../../users/profiles/6849682.webp)
bigbounty
- 13,123
- 4
- 20
- 50