Questions tagged [google-dataflow]

31 questions
3
votes
2 answers

How to count the number of rows in the input file of the Google Dataflow file processing?

I am trying to count the number of rows in an input file and I am using Cloud dataflow Runner for creating the template. In the below code, I am reading the file from a GCS bucket, processing it and then storing the output in a Redis instance. But I…
2
votes
1 answer

Dataflow SQL how to enrich Pub/Sub message

With Dataflow SQL I would like to read a Pub/Sub topic, enrich the message and write the message to a Pub/Sub topic. Which Dataflow SQL query will create my desired output message? Pub/Sub input message: {"event_timestamp":1619784049000,…
2
votes
0 answers

Google Dataflow issue

We are newly implementing DataWareHouse on Google bigquery and all our sources are on prim databases. So we are using dataflow for ETL and Maven with the Apache Beam SDK in order to run a 30 pipelines on Google Cloud Dataflow service. package…
1
vote
1 answer

delete file from Google Storage from a Dataflow job

I have a dataflow made with apache-beam in python 3.7 where I process a file and then I have to delete it. The file comes from a google Storage bucket, and the problem is that when I use the DataflowRunner runner my job doesn't work because…
1
vote
1 answer

How to deploy Google Cloud Dataflow with connection to PostgreSQL (beam-nuggets) from Google Cloud Functions

I'm trying to create ETL in GCP which will read part of data from PostgreSQL and put it in the suitable form to BigQuery. I was able to perform this task deploying Dataflow from my computer, but I failed to make it dynamic, so it will read last…
1
vote
0 answers

How to auto-scale google dataflow (streaming) pipeline?

We have streaming pipeline running in Google Dataflow. It pulls Pub/Sub message and saves into BigQuery. For some reason, in last few day we have backlog. System lag shows 9-15 hours. I follow document here, and added following…
Krishna Sunuwar
  • 2,732
  • 11
  • 22
0
votes
1 answer

Backfill Beam pipeline with historical data

I have a Google Cloud Dataflow pipeline (written with the Apache Beam SDK) that, in its normal mode of operation, handles event data published to Cloud Pub/Sub. In order to bring the pipeline state up to date, and to create the correct outputs,…
Raman
  • 13,024
  • 3
  • 72
  • 95
0
votes
0 answers

Google data flow capability from Oracle db to big query in realtime

Is Google data flow capable of realtime (or near real time) streaming from Oracle db to google big query for big tables and high transactional tables? (Real time data replication) (Data flow looks more suitable for low transactional apps…
0
votes
1 answer

AttributeError: module 'apache_beam' has no attribute 'options'

I am getting the following error when running an Apache Beam pipeline. The full error code is: --------------------------------------------------------------------------- AttributeError Traceback (most recent call…
Ekaba Bisong
  • 2,555
  • 2
  • 19
  • 34
0
votes
2 answers

Streaming csv data to BigQuery from Pub/Sub subscription using Dataflow

Exploring ETL process with GCP. I am using Pub/Sub Subscription to BigQuery template in Dataflow. Message data in Pub/Sub Subscription is a csv format as below 53466,06/30/2020,,Trinidad and Tobago,2020-07-01 04:33:52,130.0,8.0,113.0 this leaves…
0
votes
0 answers

Google Cloud Dataflow to BigQuery - Javascript UDF - toLocalString is not working properly

I am using toLocaleString to have a dateTime in a different timezone than UTC. But for some reason it does not work in dataflow process. Here are the details: I am using Pub/Sub Subscription to BigQuery Template. Dataflow fetches data in json…
0
votes
1 answer

How to process dataflow two batch files simultaneously on GCP

I want to process two files from gcp to dataflow at the same time simultaneously. I think it will be possible if one more file comes in side-input. However, in this case, I think it will be processed every time, not just once. e.g) How to read and…
꽥꽥꽥
  • 309
  • 1
  • 3
  • 14
0
votes
1 answer

Migrating from Google App Engine Mapreduce to Apache Beam

I have been a long-time user of Google App Engine's Mapreduce library for processing data in the Google Datastore. Google no longer supports it and it doesn't work at all in Python 3. I'm trying to migrate our older Mapreduce jobs to Google's…
speedplane
  • 14,130
  • 14
  • 78
  • 128
0
votes
1 answer

beam.Create() with list of dicts is extremely slow compared to a list of strings

I am using Dataflow to process a Shapefile with about 4 million features (about 2GB total) and load the geometries into BigQuery, so before my pipeline starts, I extract the shapefile features into a list, and initialize the pipeline using…
Travis Webb
  • 13,507
  • 6
  • 51
  • 101
0
votes
1 answer

Error 401 with cloud scheduler while passing Dataflow template as URL via POST request

I have created a custom template for Dataflow Batch Jobs. Now I need to run every 5 minutes using cloud scheduler. The template is stored in cloud storage. But I'm getting 401 error, whenever I pass the URI of template in my POST request from…
1
2 3