Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

373 questions

votes

1 answer

Issues while using Snappy for tensorflow preprocessing using BeamIO

While using Apache beamIO for preprocessing data, snappy library was a good to have module for compression but looks like the file transformation doesnt seems to work as it cannot find the crc32 compress function in the library! Im using…

tensorflow apache-beam tensorflow-serving apache-beam-io

asked Mar 12 '18 at 20:01

gagan malhotra

votes

1 answer

Apache Beam Dataflow Reading big CSV with splittable=True causing duplicate entries

I used the code snippet below to read CSV files into the pipeline as Dicts. class MyCsvFileSource(beam.io.filebasedsource.FileBasedSource): def read_records(self, file_name, range_tracker): self._file = self.open_file(file_name) …

python google-cloud-dataflow apache-beam apache-beam-io

asked Feb 13 '18 at 09:45

Martin van Dam

votes

2 answers

Read a pickle from another pipeline in Beam?

I'm running batch pipelines in Google Cloud Dataflow. I need to read objects in one pipeline that another pipeline has previously written. The easiest wa objects is pickle / dill. The writing works well, writing a number of files, each with a…

google-cloud-dataflow apache-beam apache-beam-io

asked Jan 15 '18 at 22:44

Maximilian

4,783
1
31
38

votes

0 answers

apache hive integration with apache beam

I am doing a POC to connect to Apache Hive in the Apache Beam pipeline and i am getting exception similar to the below SO link. I did change the version of the JDBC driver to the latest. But still facing the issue. As mentioned in the below link it…

hive google-cloud-dataflow apache-beam apache-beam-io

asked Jan 10 '18 at 08:03

sri harsha

votes

1 answer

Apache beam KafkaIO offset management to external data stores

I am trying to read from multiple kafka brokers using KafkaIO on apache beam. The default option for offset management is to the kafka partition itself (no longer using zookeper from kafka >0.9). With this setup, when i restart the job/pipeline,…

apache google-bigquery google-cloud-dataflow apache-beam apache-beam-io

asked Dec 23 '17 at 16:18

Eduard Chai

votes

1 answer

Dataflow GroupBy -> multiple outputs based on keys

Is there any simple way that I can redirect the output of GroupBy into multiple output files based on Group keys? Bin.apply(GroupByKey.>>create()) .apply(ParDo.named("Print Bins").of( ... )…

google-cloud-dataflow apache-beam apache-beam-io

asked Oct 12 '17 at 17:45

AmirCS

votes

0 answers

Using "DISTINCT" functionality in DataStoreIO.read with Apache Beam Java SDK

I am running a dataflow job (Apache Beam SDK 2.1.0 Java, Google dataflow runner) and I need to read from the Google DataStore "distinctly" on one particular property. (like the good old "DISTINCT" keyword in SQL). Here is my code snippet :…

google-cloud-datastore google-cloud-dataflow apache-beam apache-beam-io

asked Sep 25 '17 at 05:33

Venky

votes

1 answer

Apache Beam Template : Runtime Context Error

I'm currently trying to create dataflow-template based on the Apache Beam SDK v2.1.0 like the Google tutorial This is my main class public static void main(String[] args) { // Initialize options DispatcherOptions options =…

google-cloud-dataflow google-cloud-pubsub apache-beam apache-beam-io

asked Sep 22 '17 at 08:50

Pierre CORBEL

votes

1 answer

Apache Beam 2.1.0 with Google DatastoreIO calls Guava Preconditions checkArgument on non-existing function in GAE

When building a dataflow template which should read from datastore I get the following error in stackdriver logs (from Google App Engine): java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;I)V …

google-app-engine google-cloud-datastore google-cloud-dataflow apache-beam apache-beam-io

asked Sep 20 '17 at 20:00

user2122552

votes

1 answer

Apache Beam Program execution without using Maven

I want to run a simple example Beam Program using Apache Spark runner. 1) I was to able to compile the program in my local successfully. 2) I want to push the JAR file to QA box where Maven is not installed. 3) I see the example with Maven command…

apache-beam apache-beam-io

asked Sep 20 '17 at 18:30

VIjay

votes

1 answer

BigtableIO Read keys with a given prefix

I'm looking for the best way of reading all the rows with a given prefix. I see that there is a withKeyRange method in BigTableIO.Read but it requires you to specify a start-key and and an end-key. Is there a way to specify reading from a prefix?

apache-beam bigtable apache-beam-io

asked Aug 11 '17 at 19:02

Narek

votes

1 answer

Apache Beam mongodb source

I have a beam pipeline which has mongodb as source but when I try to run it it throws an exception. An exception occured while executing the Java class. null: InvocationTargetException:…

apache-beam apache-beam-io

asked Aug 11 '17 at 14:37

guru107

votes

1 answer

Google dataflow only partly uncompressing files compressed with pbzip2

seq 1 1000000 > testfile bzip2 -kz9 testfile mv testfile.bz2 testfile-bzip2.bz2 pbzip2 -kzb9 testfile mv testfile.bz2 testfile-pbzip2.bz2 gsutil cp testfile gs://[bucket] gsutil cp testfile-bzip2.bz2 gs://[bucket] gsutil cp testfile-pbzip2.bz2…

google-cloud-dataflow apache-beam bzip2 apache-beam-io

asked Aug 01 '17 at 13:23

Fernet

votes

1 answer

Error streaming from pub/sub into big query python

I am having trouble creating a dataflowRunner job that connects a pub/sub source to a big query sink, by plugging these two: apache_beam.io.gcp.pubsub.PubSubSource apache_beam.io.gcp.bigquery.BigQuerySink into lines 59 and 74 respectively in the…

python google-bigquery google-cloud-pubsub apache-beam apache-beam-io

asked Jun 29 '17 at 21:18

Evan

votes

2 answers

How do I run a beam class in dataflow which access google sql instance?

When i run my pipeline from local machine, i can update the table which resides in the cloud Sql instance. But, when i moved this to run using DataflowRunner, the same is failing with the below exception. To connect from my eclipse, I created the…

google-cloud-sql google-cloud-dataflow apache-beam apache-beam-io

asked Jun 22 '17 at 13:28

Balu

Prev 1 2 3

…

25 Next