3

I have a pipeline with one gcs file as input and generate two gcs output files.
One output file contains error info and another contains normal info.
And I have a cloud function with the gcs trigger of the two output files.
I want to do something with the normal info file only when the error info file is 0 byte.
So I must let the error info file be generated earlier than the normal info file to check the size of the error info file.

Now I use 2 TextIO.Write to generate the two files.
But I can not control which one is generated first.
In the cloud functions, I let the normal info file check the size of the error info file with a retry.
But the cloud functions as a timeout limit of 540s so I can not retry until the error info file is generated.
How can I handle this in Cloud Dataflow?
Can I generate the error info file before the normal info file programmatically?

ender1986
  • 87
  • 7

1 Answers1

0

You can accomplish sequencing like this by using side inputs. For example,

error_pcoll = ...
good_data_pcoll = ...

error_write_result = error_pcoll | beam.io.WriteToText(...)
(good_data_pcoll
 | beam.Map(
       # This lambda simply emits what it was given.
       lambda element, blocking_side: element,
       # This side input isn't used,
       # but will force error_write_result to be computed first.
       blocking_side=beam.pvalue.AsIterable(error_write_result))
 | beam.io.WriteToText(...))

The Wait PTransform encapsulates this pattern.

robertwb
  • 2,224
  • 14
  • 13
  • I read the doc of "The Wait PTransform ". Sample code in doc ("Wait.on(firstWriteResults)" ) needs a parameter of PCollection, but the result of TextIO.Write is PDone. – ender1986 Feb 17 '21 at 01:46
  • Oh, I forgot that Java's Write isn't this flexible. You could try using writeCustomType: https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/TextIO.html#writeCustomType-- – robertwb Feb 17 '21 at 07:38