Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

373 questions
8
votes
1 answer

Difference between beam.ParDo and beam.Map in the output type?

I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo In the next sample: I am…
Soliman
  • 941
  • 2
  • 11
  • 28
8
votes
6 answers

Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK. Looking at the BigQuery JSON API, to create a day partitioned table one needs…
ptf
  • 95
  • 1
  • 5
6
votes
3 answers

Is there a way to read a multi-line csv file in Apache Beam using the ReadFromText transform (Python)?

Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot get it to work. def print_each_line(line): print…
5
votes
1 answer

How to solve Duplicate values exception when I create PCollectionView>

I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception :…
5
votes
2 answers

How to specify insertId when spreaming insert to BigQuery using Apache Beam

BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam? https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency To help ensure data consistency, you can supply insertId for each…
5
votes
1 answer

Google Dataflow (Apache beam) JdbcIO bulk insert into mysql database

I'm using Dataflow SDK 2.X Java API ( Apache Beam SDK) to write data into mysql. I've created pipelines based on Apache Beam SDK documentation to write data into mysql using dataflow. It inserts single row at a time where as I need to implement bulk…
5
votes
1 answer

How can I improve performance of TextIO or AvroIO when reading a very large number of files?

TextIO.read() and AvroIO.read() (as well as some other Beam IO's) by default don't perform very well in current Apache Beam runners when reading a filepattern that expands into a very large number of files - for example, 1M files. How can I read…
jkff
  • 16,670
  • 3
  • 46
  • 79
4
votes
1 answer

Comparing objects using PAssert containsInAnyOrder in Apache Beam

While writing unit tests for my beam pipeline using PAssert, the pipeline outputs objects fine but the test fails during comparison with following assertion error: java.lang.AssertionError: Decode pubsub…
Zain Qasmi
  • 175
  • 1
  • 10
4
votes
1 answer

Streaming MutationGroups into Spanner

I'm trying to stream MutationGroups into spanner with SpannerIO. The goal is to write new MuationGroups every 10 seconds, as we will use spanner to query near-time KPI's. When I don't use any windows, I get the following error: Exception in thread…
4
votes
2 answers

Read and Write serialized protobuf in Beam

I suppose it should be faily easy to write PCollection of serialized protobuf messages into Text files and read them back. But I failed to do so after a few attempts. Would appreciate it if anyone has any comment. // definition of proto. syntax =…
greeness
  • 15,423
  • 5
  • 47
  • 77
4
votes
1 answer

HTTP Client in DoFn

I would like to make POST request through a DoFn for a Apache Beam Pipeline running on Dataflow. For that, I have created a client which instanciate an HttpClosableClient configured on a PoolingHttpClientConnectionManager. However, I instanciate a…
4
votes
2 answers

Writing an an unbounded collection to GCS

I have seen many questions on the same topic. But, I am still having problem with writing to GCS. I am reading the topic from pubsub and trying to push this to GCS. I have referred to this link. But, couldn't find the IOChannelUtils in the latest…
Balu
  • 436
  • 5
  • 19
4
votes
1 answer

Group elements in Apache Beam pipeline

I have got a pipeline that parses records from AVRO files. I need to split the incoming records into chunks of 500 items in order to call an API that takes multiple inputs at the same time. Is there a way to do this with the Python SDK?
4
votes
1 answer

Acknowledge Google Pub/Sub message on Apache Beam

I'm trying to read from pub/sub with the following code Read pubsub =…
njLT
  • 434
  • 5
  • 17
3
votes
1 answer

How to generate gcs files one after another with google cloud dataflow and java?

I have a pipeline with one gcs file as input and generate two gcs output files. One output file contains error info and another contains normal info. And I have a cloud function with the gcs trigger of the two output files. I want to do something…
1
2 3
24 25