1

I have a dataflow made with apache-beam in python 3.7 where I process a file and then I have to delete it. The file comes from a google Storage bucket, and the problem is that when I use the DataflowRunner runner my job doesn't work because google-cloud-storage API is not installed in the Google Dataflow python 3.7 environment. Do you know guys how could I delete this file inside my dataflow without using this API? I've seen apache_beam modules like https://beam.apache.org/releases/pydoc/2.22.0/apache_beam.io.filesystem.html but I don't have any idea of how to use it, and haven't found a tutorial or example on how to use this module.

Felipe Sierra
  • 53
  • 1
  • 7
  • You should be able to include the GCS Python library when you run the pipeline by following https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/. – Brent Worden Jul 17 '20 at 01:55

1 Answers1

2

I don't think you can delete while running the dataflow job. You have to delete the file after the dataflow job is completed. I normally recommend some kind of orchestration like apache airflow or Google Cloud Composer.

You can make a DAG in airflow as follows - enter image description here

Here,

"Custom DAG Workflow" will have the dataflow job.
"Custom Python Code" will have the python code to delete the file

Refer - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples

bigbounty
  • 13,123
  • 4
  • 20
  • 50