Questions tagged [data-pipeline]
95 questions
2
votes
1 answer
Python psycopg2: Copy result of query to another table
I am having some problem with psycopg2 in python
I have two disparate connections with corresponding cursors:
1. Source connection - source_cursor
2. Destination connection - dest_cursor
Lets say there is a select query that I want to execute on…
![](../../users/profiles/8872639.webp)
skybunk
- 603
- 7
- 16
2
votes
1 answer
Is it possible to create EMR cluster with Auto scaling using Data pipeline
I am new to AWS. I have created a EMR cluster using Auto scaling policy through AWS console. I have also created a data pipeline which can use this cluster to perform the activities.
I am also able to create EMR cluster dynamically through data…
![](../../users/profiles/1069852.webp)
Bharani
- 329
- 1
- 6
- 17
2
votes
1 answer
How to configure AWS data pipeline using serverless.yml?
I am new to both data pipeline and serverless. I want to know how can I automate AWS data pipeline using serverless. Below is my diagram of AWS data pipeline which exports dynamo db table to S3
![](../../users/profiles/6295044.webp)
deosha
- 832
- 5
- 19
2
votes
1 answer
luigi upstream task should run once to create input for set of downstream tasks
I have a nice straight working pipe, where the task I run via luigi on the command line triggers all the required upstream data fetch and processing in it's proper sequence till it trickles out into my database.
class IMAP_Fetch(luigi.Task):
…
![](../../users/profiles/2354978.webp)
ib4u
- 43
- 5
2
votes
0 answers
airflow big dag_pickle table
I set up a test installation of airflow a while ago with one test DAG which is in paused state.
Now, after this system ran for some weeks without actually doing much (beside some test runs), I wanted to dump the database and realized, it is…
![](../../users/profiles/2375154.webp)
Alexander Köb
- 904
- 1
- 8
- 19
2
votes
1 answer
How can we provision number of core instances in AWS Data Pipeline job
Requirement: Restore DynamoDB table from S3 Backup location.
We created Data Pipeline job, and then edit Resources section in Architect Wizard.
We placed 20 instances under Core Instance count, but after the Data Pipeline job activation, EMR Cluster…
![](../../users/profiles/3876843.webp)
u1234
- 81
- 1
- 11
1
vote
0 answers
Best data pipeline framework
What is the best data pipeline framework that fits the following requirements?:
Open source / free to use
Data pipeline need to be created using Python (should support Geopandas, Pandas, Numpy, ...)
Support manuel and time triggered pipelines
Web…
![](../../users/profiles/13654695.webp)
MartinV
- 13
- 2
1
vote
1 answer
insert into SQL Server table using python from CSV and Text file
I am trying to insert data from a CSV file and also from a textfile into SQL SERVER SSMS version 18.7. Below is my code.
import pyodbc
import csv
conn = pyodbc.connect('Driver={SQL Server};'
'Server=????;'
…
![](../../users/profiles/15046465.webp)
nikhil davis
- 55
- 3
1
vote
1 answer
why did amount of data from bigquery decrease noticeably without any change in ga/firebase options?
I use Bigquery to get raw data from ga and firebase.
I could get about 100000 ~ 200000 rows of log data from Bigquery.
But since last week, I got about 1000 rows from Bigquery.
enter image description here
I didn't change any options for ga,…
![](../../users/profiles/14989014.webp)
Seohyeon Youn
- 11
- 2
1
vote
1 answer
Copy and Extracting Zipped XML files from HTTP Link Source to Azure Blob Storage using Azure Data Factory
I am trying to establish an Azure Data Factory copy data pipeline. The source is an open HTTP Linked Source (Url reference: https://clinicaltrials.gov/AllPublicXML.zip). So basically the source contains a zipped folder having many XML files. I want…
![](../../users/profiles/12407899.webp)
Aditya Bhattacharya
- 534
- 1
- 5
- 15
1
vote
1 answer
Airflow on Google Cloud Composer vs Docker
I can't find much information on what the differences are in running Airflow on Google Cloud Composer vs Docker. I am trying to switch our data pipelines that are currently on Google Cloud Composer onto Docker to just run locally but am trying to…
![](../../users/profiles/13878932.webp)
Erika_Marsha
- 13
- 4
1
vote
1 answer
How should I keep track of total loss while training a network with a batched dataset?
I am attempting to train a discriminator network by applying gradients to its optimizer. However, when I use a tf.GradientTape to find the gradients of loss w.r.t training variables, None is returned. Here is the training loop:
def train_step():
…
![](../../users/profiles/12732865.webp)
Andrew Wiedenmann
- 167
- 1
- 12
1
vote
1 answer
Replication pipeline to replicate data from MySql RDS to Redshift
My problem is here to create a replication pipeline that replicates tables and data from MySql RDS to Redshift and I cannot use any managed service. Also, any new updates in RDS should be replicated in the redshift tables as well.
After looking at…
![](../../users/profiles/11807635.webp)
Anonymous
- 11
- 3
1
vote
0 answers
Google Data Fusion: "Looping" over input data to then execute multiple Restful API calls per input row
I have the following challenge I would like to solve preferably in Google Data Fusion:
I have one web service that returns about 30-50 elements describing an invoice in a JSON payload like this:
{
"invoice-services": [
{
"serviceId":…
![](../../users/profiles/13110137.webp)
JensU
- 11
- 1
1
vote
1 answer
How to import Pascal VOC 2012 segmentation dataset to Google Colab?
I am new in building data pipe-line. I want to import Pascal VOC dataset into Google Colab.
Can some please point to me a good Google Colab/Jupyter notebook file?