Backfill Beam pipeline with historical data

Question

I have a Google Cloud Dataflow pipeline (written with the Apache Beam SDK) that, in its normal mode of operation, handles event data published to Cloud Pub/Sub.

In order to bring the pipeline state up to date, and to create the correct outputs, there is a significant amount of historical event data which must be processed first. This historical data is available via JDBC. In testing, I am able to use the JdbcIO.Read PTransform to read and handle all historical state, but I'd like to initialize my production pipeline using this JDBC event data, and then cleanly transition to reading events from Pub/Sub. This same process may happen again in the future if the pipeline logic is ever altered in a backward incompatible way.

Note that while this historical read is happening, new events are continuing to arrive into Pub/Sub (and these end up in the database also), so there should be a clean cutover from only historical events read from JDBC, and only newer events read from Pub/Sub.

Some approaches I have considered:

Have a pipeline that reads from both inputs, but filters data from JDBC before a certain timestamp, and from pub/sub after a certain timestamp. Once the pipeline is caught up deploy an update removing the JDBC input.

I don't think this will work because removal of an I/O transform is not backward compatible. Alternately, the JDBC part of the pipeline must stay there forever, burning CPU cycles for no good reason.
Write a one-time job that populates pub/sub with the entirety of the historical data, and then starts the main pipeline reading only from pub/sub.

This seems to use more pub/sub resources than necessary, AND I think newer data interleaved in the pipeline with much older data will cause watermarks to be advanced too early.
Variation of option #2 -- stop creating new events until the historical data is handled, to avoid messing up watermarks.

This requires downtime.

It seems like it would be a common requirement to backfill historical data into a pipeline, but I haven't been able to find a good approach to this.

score 1 · Accepted Answer · answered May 11 '21 at 18:10

1

Your first option, reading from a Bounded source (filtered to timestamp <= cutoff) and PubSub (filtered to timestamp > cutoff) should work well.

Because JDBC.Read() is a bounded source, it will be read all the data and then "finish" i.e. never produce any more data, advance its watermark to +infinity, and not be invoked again (so there's no concern about it consuming CPU cycles, even if it's present in your graph).

answered May 11 '21 at 18:10

robertwb

2,224
14
13

When reading from JDBC, all timestamps will by default be -infinity. How are timers being used? – robertwb May 11 '21 at 21:43
I'm looking into some issues with Google Dataflow support, but I don't think these are related to the general approach, which seems like it should work. Thanks. – Raman May 17 '21 at 18:36

Backfill Beam pipeline with historical data

1 Answers1