Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

373 questions
0
votes
1 answer

Issues while using Snappy for tensorflow preprocessing using BeamIO

While using Apache beamIO for preprocessing data, snappy library was a good to have module for compression but looks like the file transformation doesnt seems to work as it cannot find the crc32 compress function in the library! Im using…
0
votes
1 answer

Apache Beam Dataflow Reading big CSV with splittable=True causing duplicate entries

I used the code snippet below to read CSV files into the pipeline as Dicts. class MyCsvFileSource(beam.io.filebasedsource.FileBasedSource): def read_records(self, file_name, range_tracker): self._file = self.open_file(file_name) …
0
votes
2 answers

Read a pickle from another pipeline in Beam?

I'm running batch pipelines in Google Cloud Dataflow. I need to read objects in one pipeline that another pipeline has previously written. The easiest wa objects is pickle / dill. The writing works well, writing a number of files, each with a…
Maximilian
  • 4,783
  • 1
  • 31
  • 38
0
votes
0 answers

apache hive integration with apache beam

I am doing a POC to connect to Apache Hive in the Apache Beam pipeline and i am getting exception similar to the below SO link. I did change the version of the JDBC driver to the latest. But still facing the issue. As mentioned in the below link it…
0
votes
1 answer

Apache beam KafkaIO offset management to external data stores

I am trying to read from multiple kafka brokers using KafkaIO on apache beam. The default option for offset management is to the kafka partition itself (no longer using zookeper from kafka >0.9). With this setup, when i restart the job/pipeline,…
0
votes
1 answer

Dataflow GroupBy -> multiple outputs based on keys

Is there any simple way that I can redirect the output of GroupBy into multiple output files based on Group keys? Bin.apply(GroupByKey.>>create()) .apply(ParDo.named("Print Bins").of( ... )…
AmirCS
  • 181
  • 1
  • 12
0
votes
0 answers

Using "DISTINCT" functionality in DataStoreIO.read with Apache Beam Java SDK

I am running a dataflow job (Apache Beam SDK 2.1.0 Java, Google dataflow runner) and I need to read from the Google DataStore "distinctly" on one particular property. (like the good old "DISTINCT" keyword in SQL). Here is my code snippet :…
0
votes
1 answer

Apache Beam Template : Runtime Context Error

I'm currently trying to create dataflow-template based on the Apache Beam SDK v2.1.0 like the Google tutorial This is my main class public static void main(String[] args) { // Initialize options DispatcherOptions options =…
0
votes
1 answer

Apache Beam 2.1.0 with Google DatastoreIO calls Guava Preconditions checkArgument on non-existing function in GAE

When building a dataflow template which should read from datastore I get the following error in stackdriver logs (from Google App Engine): java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;I)V …
0
votes
1 answer

Apache Beam Program execution without using Maven

I want to run a simple example Beam Program using Apache Spark runner. 1) I was to able to compile the program in my local successfully. 2) I want to push the JAR file to QA box where Maven is not installed. 3) I see the example with Maven command…
VIjay
  • 97
  • 7
0
votes
1 answer

BigtableIO Read keys with a given prefix

I'm looking for the best way of reading all the rows with a given prefix. I see that there is a withKeyRange method in BigTableIO.Read but it requires you to specify a start-key and and an end-key. Is there a way to specify reading from a prefix?
Narek
  • 363
  • 5
  • 23
0
votes
1 answer

Apache Beam mongodb source

I have a beam pipeline which has mongodb as source but when I try to run it it throws an exception. An exception occured while executing the Java class. null: InvocationTargetException:…
guru107
  • 913
  • 1
  • 9
  • 23
0
votes
1 answer

Google dataflow only partly uncompressing files compressed with pbzip2

seq 1 1000000 > testfile bzip2 -kz9 testfile mv testfile.bz2 testfile-bzip2.bz2 pbzip2 -kzb9 testfile mv testfile.bz2 testfile-pbzip2.bz2 gsutil cp testfile gs://[bucket] gsutil cp testfile-bzip2.bz2 gs://[bucket] gsutil cp testfile-pbzip2.bz2…
0
votes
1 answer

Error streaming from pub/sub into big query python

I am having trouble creating a dataflowRunner job that connects a pub/sub source to a big query sink, by plugging these two: apache_beam.io.gcp.pubsub.PubSubSource apache_beam.io.gcp.bigquery.BigQuerySink into lines 59 and 74 respectively in the…
0
votes
2 answers

How do I run a beam class in dataflow which access google sql instance?

When i run my pipeline from local machine, i can update the table which resides in the cloud Sql instance. But, when i moved this to run using DataflowRunner, the same is failing with the below exception. To connect from my eclipse, I created the…
1 2 3
24
25