Questions tagged [data-pipeline]

95 questions
0
votes
1 answer

Estimate duration of DynamoDB data export via Data Pipeline

My DynamoDB table has around 100 million (30GB) items and I provisioned it with 10k RCUs. I'm using a data pipeline job to export the data. The DataPipeline Read Throughput Ratio set to 0.9. How do I calculate the time for the export to be…
0
votes
0 answers

Session Window not working properly in apache beam on Cloud Dataflow

Requirement is as follows: we want to track user events and create sessions based on the below logic 30 minutes of inactivity OR UTC end of the day For this, we publish all the user events into pubsub. In the apache beam pipeline, we read the…
0
votes
1 answer

How would a data pipeline using S3 as raw data work?

I am currently using AWS S3 as a data lake to store raw data, which adds about 100 items every minute to the designated bucket. I know the very basics of the data pipeline and data ETL concept, but I am still unfamiliar with the fundamentals, such…
0
votes
1 answer

AWS Datapipeline RDS to S3 Activity Error: Unable to establish connection to jdbc://mysql:

I am currently setting up a AWS Data Pipeline using the RDStoRedshift Template. During the first RDStoS3Copy activity I am receiving the following error: "[ERROR]…
kasey
  • 1
  • 1
0
votes
1 answer

How to load .npy File in a tensorflow pipeline with tf.data

I'm trying to read my X and y Data from .npy files with np.load() in a tf.data pipeline. But get the following error if i call model.fit(). Have someone a soloution for that problem? I thought i have to give the shape of X_data and y_data to the…
0
votes
0 answers

AWS: Export RDS data to S3 using group by clause

I have a following requriment to copy RDS MYSQL data to S3 There are 2 tables account and activity. Each activity is associated with a account. Need to export activity to S3 by grouping based on the account and each account having it separate S3…
0
votes
1 answer

Generate a progressive number when new record are inserted (some record need to have the same number)

the Title can be a little confused. Let me explain the problem. I have a pipeline that loads new record daily. This record contain sales. The key is . This data are loaded into a redshift table and than are exposed…
Simone Giusso
  • 173
  • 2
  • 2
  • 19
0
votes
1 answer

Airflow 1.10.13, 2020-11-24 issues after update with pip install

Below is the configuration which worked till December 1st composer-1.11.2-airflow-1.10.6 Python – 3.6 'dbt==0.17.0', 'google-cloud-storage', 'google-cloud-secret-manager==1.0.0', 'protobuf==3.12.2' With the above configuration we are observing…
0
votes
2 answers

DynamoDB data load after transforming files. Any AWS service like GCP Dataflow/Apache Beam?

New to AWS. I have a requirement to create a daily batch pipeline Read 6-10 1GB+ CSV Files. (Each file is an extract of a table from a SQL db.) Transform each file with some logic and join all files to create one item per id. Load this joined data…
0
votes
2 answers

Using a correct datapipeline for CloudSQL to BigQuery

I'm really new in this whole data engineering whilst I'm taking this matter as my thesis project, so bear with me. I'm currently developing a big data platform for a battery storage system that already has CloudSQL services that collect data every…
0
votes
1 answer

"AssertionError: Unrecognized instruction format" while splitting a dataset using Splits API - Tensorflow2.x

Please read the given problem. You need to use subsets of the original cats_vs_dogs data, which is entirely in the 'train' split. I.E. 'train' contains 25000 records with 1738 corrupted images to in total you have 23,262 images. You will split it up…
0
votes
0 answers

How to design a data pipeline for a batch video processing architecture?

I am building a data pipeline for a project, where video is received and processed on some server, where it is stored and processed in batch processing mode with some machine learning tools to recognize objects. Processed video outputs should also…
0
votes
0 answers

AWS Datapipeline using Same EC2 instance for multiple activities

I've created a datapipeline with around 10 CopyActivities. All the CopyActivities are running on Ec2Resource. Every time a copy activity runs, it spins up a new EC2 instance in AWS Account. After completion of the activity, the ec2 instance gets…
0
votes
2 answers

Jupyter notebooks as Kedro node

How can I use a Jupyter Notebook as a node in Kedro pipeline? This is different from converting functions from Jupyter Notebooks into Kedro nodes. What I want to do is using the full notebook as the node.
MCK
  • 11
0
votes
0 answers

How to make data transforms in Singer ETL tool?

I am using Singer ETL tool for data pipeline between postgres to bigquery. Using tap-postgres, i am fetching data and using target-bigquery, i am sinking it to bigquery. My question is, if i want to make some transformations in data(like counting…
Joseph N
  • 330
  • 1
  • 11