Questions tagged [data-pipeline]

95 questions
2
votes
1 answer

Python psycopg2: Copy result of query to another table

I am having some problem with psycopg2 in python I have two disparate connections with corresponding cursors: 1. Source connection - source_cursor 2. Destination connection - dest_cursor Lets say there is a select query that I want to execute on…
skybunk
  • 603
  • 7
  • 16
2
votes
1 answer

Is it possible to create EMR cluster with Auto scaling using Data pipeline

I am new to AWS. I have created a EMR cluster using Auto scaling policy through AWS console. I have also created a data pipeline which can use this cluster to perform the activities. I am also able to create EMR cluster dynamically through data…
2
votes
1 answer

How to configure AWS data pipeline using serverless.yml?

I am new to both data pipeline and serverless. I want to know how can I automate AWS data pipeline using serverless. Below is my diagram of AWS data pipeline which exports dynamo db table to S3
2
votes
1 answer

luigi upstream task should run once to create input for set of downstream tasks

I have a nice straight working pipe, where the task I run via luigi on the command line triggers all the required upstream data fetch and processing in it's proper sequence till it trickles out into my database. class IMAP_Fetch(luigi.Task): …
ib4u
  • 43
  • 5
2
votes
0 answers

airflow big dag_pickle table

I set up a test installation of airflow a while ago with one test DAG which is in paused state. Now, after this system ran for some weeks without actually doing much (beside some test runs), I wanted to dump the database and realized, it is…
2
votes
1 answer

How can we provision number of core instances in AWS Data Pipeline job

Requirement: Restore DynamoDB table from S3 Backup location. We created Data Pipeline job, and then edit Resources section in Architect Wizard. We placed 20 instances under Core Instance count, but after the Data Pipeline job activation, EMR Cluster…
1
vote
0 answers

Best data pipeline framework

What is the best data pipeline framework that fits the following requirements?: Open source / free to use Data pipeline need to be created using Python (should support Geopandas, Pandas, Numpy, ...) Support manuel and time triggered pipelines Web…
MartinV
  • 13
  • 2
1
vote
1 answer

insert into SQL Server table using python from CSV and Text file

I am trying to insert data from a CSV file and also from a textfile into SQL SERVER SSMS version 18.7. Below is my code. import pyodbc import csv conn = pyodbc.connect('Driver={SQL Server};' 'Server=????;' …
1
vote
1 answer

why did amount of data from bigquery decrease noticeably without any change in ga/firebase options?

I use Bigquery to get raw data from ga and firebase. I could get about 100000 ~ 200000 rows of log data from Bigquery. But since last week, I got about 1000 rows from Bigquery. enter image description here I didn't change any options for ga,…
1
vote
1 answer

Copy and Extracting Zipped XML files from HTTP Link Source to Azure Blob Storage using Azure Data Factory

I am trying to establish an Azure Data Factory copy data pipeline. The source is an open HTTP Linked Source (Url reference: https://clinicaltrials.gov/AllPublicXML.zip). So basically the source contains a zipped folder having many XML files. I want…
1
vote
1 answer

Airflow on Google Cloud Composer vs Docker

I can't find much information on what the differences are in running Airflow on Google Cloud Composer vs Docker. I am trying to switch our data pipelines that are currently on Google Cloud Composer onto Docker to just run locally but am trying to…
1
vote
1 answer

How should I keep track of total loss while training a network with a batched dataset?

I am attempting to train a discriminator network by applying gradients to its optimizer. However, when I use a tf.GradientTape to find the gradients of loss w.r.t training variables, None is returned. Here is the training loop: def train_step(): …
1
vote
1 answer

Replication pipeline to replicate data from MySql RDS to Redshift

My problem is here to create a replication pipeline that replicates tables and data from MySql RDS to Redshift and I cannot use any managed service. Also, any new updates in RDS should be replicated in the redshift tables as well. After looking at…
1
vote
0 answers

Google Data Fusion: "Looping" over input data to then execute multiple Restful API calls per input row

I have the following challenge I would like to solve preferably in Google Data Fusion: I have one web service that returns about 30-50 elements describing an invoice in a JSON payload like this: { "invoice-services": [ { "serviceId":…
JensU
  • 11
  • 1
1
vote
1 answer

How to import Pascal VOC 2012 segmentation dataset to Google Colab?

I am new in building data pipe-line. I want to import Pascal VOC dataset into Google Colab. Can some please point to me a good Google Colab/Jupyter notebook file?