I'm using Dataflow SDK 2.X Java API ( Apache Beam SDK) to write data into mysql. I've created pipelines based on Apache Beam SDK documentation to write data into mysql using dataflow. It inserts single row at a time where as I need to implement bulk insert. I do not find any option in official documentation to enable bulk inset mode.

Wondering, if it's possible to set bulk insert mode in dataflow pipeline? If yes, please let me know what I need to change in below code.

 .apply(JdbcIO.<KV<Integer, String>>write()
            "com.mysql.jdbc.Driver", "jdbc:mysql://hostname:3306/mydb")
      .withStatement("insert into Person values(?, ?)")
      .withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, String>>() {
        public void setParameters(KV<Integer, String> element, PreparedStatement query) {
          query.setInt(1, kv.getKey());
          query.setString(2, kv.getValue());
  • 51
  • 2
  • 4
  • I'm confused: the code you included *reads* data, rather than inserts: you're using JdbcIO.read(). Did you mean to include a different code snippet? If you use JdbcIO.write(), it automatically batches the writes into up to 1000 elements (it can end up being fewer in practice, depending on the structure of your pipeline, the runner, your data arrival rate etc.). – jkff Dec 08 '17 at 17:20
  • Thanks for your response @jkff. Is there any way to update the number of elements to be inserted in batch? – Michael Dec 09 '17 at 08:37
  • Currently no. Is it too much or too little for your needs? – jkff Dec 09 '17 at 17:57
  • It's too little for my requirement. – Michael Dec 11 '17 at 05:28
  • Hmm, you mean that there's a substantial performance gain from using a larger value? I'm curious what value you would suggest and how much faster it makes the whole pipeline end to end? You can try that by just making a copy of JdbcIO and editing it. – jkff Dec 11 '17 at 14:31

1 Answers1


EDIT 2018-01-27:

It turns out that this issue is related to the DirectRunner. If you run the same pipeline using the DataflowRunner, you should get batches that are actually up to 1,000 records. The DirectRunner always creates bundles of size 1 after a grouping operation.

Original answer:

I've run into the same problem when writing to cloud databases using Apache Beam's JdbcIO. The problem is that while JdbcIO does support writing up to 1,000 records in one batch, in I have never actually seen it write more than 1 row at a time (I have to admit: This was always using the DirectRunner in a development environment).

I have therefore added a feature to JdbcIO where you can control the size of the batches yourself by grouping your data together and writing each group as one batch. Below is an example of how to use this feature based on the original WordCount example of Apache Beam.

p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
    // Count words in input file(s)
    .apply(new CountWords())
    // Format as text
    .apply(MapElements.via(new FormatAsTextFn()))
    // Make key-value pairs with the first letter as the key
    .apply(ParDo.of(new FirstLetterAsKey()))
    // Group the words by first letter
    .apply(GroupByKey.<String, String> create())
    // Get a PCollection of only the values, discarding the keys
    .apply(ParDo.of(new GetValues()))
    // Write the words to the database
    .apply(JdbcIO.<String> writeIterable()
                JdbcIO.DataSourceConfiguration.create(options.getJdbcDriver(), options.getURL()))
            .withPreparedStatementSetter(new WordCountPreparedStatementSetter()));

The difference with the normal write-method of JdbcIO is the new method writeIterable() that takes a PCollection<Iterable<RowT>> as input instead of PCollection<RowT>. Each Iterable is written as one batch to the database.

The version of JdbcIO with this addition can be found here: https://github.com/olavloite/beam/blob/JdbcIOIterableWrite/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

The entire example project containing the example above can be found here: https://github.com/olavloite/spanner-beam-example

(There is also a pull request pending on Apache Beam to include this in the project)

Knut Olav Loite
  • 1,604
  • 5
  • 14