8

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.

Looking at the BigQuery JSON API, to create a day partitioned table one needs to pass in a

"timePartitioning": { "type": "DAY" }

option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference.

I thought that maybe I could pre-create the table, and sneak in a partition decorator via a BigQueryIO.Write.toTableReference lambda..? Is anyone else having success with creating/writing partitioned tables via Dataflow?

This seems like a similar issue to setting the table expiration time which isn't currently available either.

Sam McVeety
  • 3,094
  • 1
  • 10
  • 34
ptf
  • 95
  • 1
  • 5

6 Answers6

8

As Pavan says, it is definitely possible to write to partition tables with Dataflow. Are you using the DataflowPipelineRunner operating in streaming mode or batch mode?

The solution you proposed should work. Specifically, if you pre-create a table with date partitioning set up, then you can use a BigQueryIO.Write.toTableReference lambda to write to a date partition. For example:

/**
 * A Joda-time formatter that prints a date in format like {@code "20160101"}.
 * Threadsafe.
 */
private static final DateTimeFormatter FORMATTER =
    DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC);

// This code generates a valid BigQuery partition name:
Instant instant = Instant.now(); // any Joda instant in a reasonable time range
String baseTableName = "project:dataset.table"; // a valid BigQuery table name
String partitionName =
    String.format("%s$%s", baseTableName, FORMATTER.print(instant));
Dan Halperin
  • 2,007
  • 1
  • 16
  • 23
  • 3
    This method is very nice, but it will only allow to control the date stamp with parameters outside of the pipeline. What if we wanted to use timestamps from the data itself to split them by dates and then write into according tables? – nembleton Jul 02 '16 at 02:08
  • 3
    @nembleton : If the elements have timestamps, you can use windowing to map them into daily windows. Modify this code: `PCollection windowedItems = items.apply( Window.into(FixedWindows.of(Duration.standardMinutes(10))));`. Then the TableSpecFun which reads the windows will map elements into the correct days. Code is from the [FixedWindows javadoc](https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/windowing/FixedWindows) – Dan Halperin Jul 02 '16 at 03:00
  • 1
    Thanks @DanHalperin that is pretty much what I am doing including the windowing, but using `.apply(Window.into(CalendarWindows.days(1)))` The only problem is since the data can be in different timezones and we want BigQuery to return the data in the original timezone, we do some funkiness in an earlier PTransform with a `outputWithTimestamp` call – ptf Jul 03 '16 at 22:17
  • 2
    @JulianV.Modesto is right, the 1.6 SDK switches to writing to BigQuery in streaming mode if a table reference is supplied.. which doesn't yet allow table decorators – ptf Jul 06 '16 at 01:00
  • Using BigQuery's streaming write API, I believe that is correct. – Dan Halperin Jul 06 '16 at 16:30
7

The approach I took (works in the streaming mode, too):

  • Define a custom window for the incoming record
  • Convert the window into the table/partition name

    p.apply(PubsubIO.Read
                .subscription(subscription)
                .withCoder(TableRowJsonCoder.of())
            )
            .apply(Window.into(new TablePartitionWindowFn()) )
            .apply(BigQueryIO.Write
                           .to(new DayPartitionFunc(dataset, table))
                           .withSchema(schema)
                           .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
            );
    

Setting the window based on the incoming data, the End Instant can be ignored, as the start value is used for setting the partition:

public class TablePartitionWindowFn extends NonMergingWindowFn<Object, IntervalWindow> {

private IntervalWindow assignWindow(AssignContext context) {
    TableRow source = (TableRow) context.element();
    String dttm_str = (String) source.get("DTTM");

    DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd").withZoneUTC();

    Instant start_point = Instant.parse(dttm_str,formatter);
    Instant end_point = start_point.withDurationAdded(1000, 1);

    return new IntervalWindow(start_point, end_point);
};

@Override
public Coder<IntervalWindow> windowCoder() {
    return IntervalWindow.getCoder();
}

@Override
public Collection<IntervalWindow> assignWindows(AssignContext c) throws Exception {
    return Arrays.asList(assignWindow(c));
}

@Override
public boolean isCompatible(WindowFn<?, ?> other) {
    return false;
}

@Override
public IntervalWindow getSideInputWindow(BoundedWindow window) {
    if (window instanceof GlobalWindow) {
        throw new IllegalArgumentException(
                "Attempted to get side input window for GlobalWindow from non-global WindowFn");
    }
    return null;
}

Setting the table partition dynamically:

public class DayPartitionFunc implements SerializableFunction<BoundedWindow, String> {

String destination = "";

public DayPartitionFunc(String dataset, String table) {
    this.destination = dataset + "." + table+ "$";
}

@Override
public String apply(BoundedWindow boundedWindow) {
    // The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
    String dayString = DateTimeFormat.forPattern("yyyyMMdd")
                                     .withZone(DateTimeZone.UTC)
                                     .print(((IntervalWindow) boundedWindow).start());
    return destination + dayString;
}}

Is there a better way of achieving the same outcome?

Evgeny Minkevich
  • 1,913
  • 2
  • 22
  • 35
3

I believe it should be possible to use the partition decorator when you are not using streaming. We are actively working on supporting partition decorators through streaming. Please let us know if you are seeing any errors today with non-streaming mode.

Pavan Edara
  • 2,195
  • 1
  • 9
  • 12
  • 1
    Hi @Pavan, we are using the BlockingDataflowPipelineRunner and running in batch mode, but the BigQueryIO.Write step fails with `400 Bad Request` and `"Table decorators cannot be used with streaming insert."` Is there a way to not use streaming writes to BigQuery? I thought it would actually do a bulk load. And is there a timeline for streaming mode to be supported? – ptf Jul 06 '16 at 00:07
  • Ah, looks like a table reference function causes it to go into streaming mode :( – ptf Jul 06 '16 at 00:10
  • Hi @Pavan, any timeline when table decorators will be supported during streaming ? – manishpal Jul 09 '16 at 16:49
  • Hopefully by the end of this month – Pavan Edara Jul 09 '16 at 21:52
1

Apache Beam version 2.0 supports sharding BigQuery output tables out of the box.

Tobi
  • 734
  • 7
  • 22
0

If you pass the table name in table_name_YYYYMMDD format, then BigQuery will treat it as a sharded table, which can simulate partition table features. Refer the documentation: https://cloud.google.com/bigquery/docs/partitioned-tables

Rajesh Hegde
  • 2,482
  • 1
  • 14
  • 22
  • 1
    Wrong! BigQuery will treat it as a regular table! the only what can make you think that BigQuery somehow specially treat such table is because BigQuery UI combines such tables under one entry `table_name (NN)` but rather than this it is oly up to user to know meaning behind such naming and table is NOT partitioned based on name – Mikhail Berlyant Mar 31 '18 at 15:26
  • 1
    @MikhailBerlyant, Yes, it will not be a partition table, but it will create a sharded table which can simulate partition table features. This is last resort till beam provides an option for passing partition column as a parameter. – Rajesh Hegde Mar 31 '18 at 18:23
  • 1
    So, at least the way you updated the answer makes it not that wrong now :o) – Mikhail Berlyant Mar 31 '18 at 18:25
  • You need to use the $ notation too – Gerard Sep 17 '19 at 07:51
0

I have written data into bigquery partitioned tables through dataflow. These writings are dynamic as-in if the data in that partition already exists then I can either append to it or overwrite it.

I have written the code in Python. It is a batch mode write operation into bigquery.

client = bigquery.Client(project=projectName)
dataset_ref = client.dataset(datasetName)
table_ref = dataset_ref.table(bqTableName)       
job_config = bigquery.LoadJobConfig()
job_config.skip_leading_rows = skipLeadingRows
job_config.source_format = bigquery.SourceFormat.CSV
if tableExists(client, table_ref):            
    job_config.autodetect = autoDetect
    previous_rows = client.get_table(table_ref).num_rows
    #assert previous_rows > 0
    if allowJaggedRows is True:
        job_config.allowJaggedRows = True
    if allowFieldAddition is True:
        job_config._properties['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']
    if isPartitioned is True:
        job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
    if schemaList is not None:
        job_config.schema = schemaList            
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
else:            
    job_config.autodetect = autoDetect
    job_config._properties['createDisposition'] = 'CREATE_IF_NEEDED'
    job_config.schema = schemaList
    if isPartitioned is True:             
        job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
    if schemaList is not None:
        table = bigquery.Table(table_ref, schema=schemaList)            
load_job = client.load_table_from_uri(gcsFileName, table_ref, job_config=job_config)        
assert load_job.job_type == 'load'
load_job.result()       
assert load_job.state == 'DONE'

It works fine.