Questions tagged [data-pipeline]
95 questions
18
votes
4 answers
Feeding .npy (numpy files) into tensorflow data pipeline
Tensorflow seems to lack a reader for ".npy" files.
How can I read my data files into the new tensorflow.data.Dataset pipline?
My data doesn't fit in memory.
Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as…
![](../../users/profiles/7758192.webp)
Sluggish Crow
- 183
- 1
- 1
- 5
13
votes
2 answers
How to access the response from Airflow SimpleHttpOperator GET request
I'm learning Airflow and have a simple question. Below is my DAG called dog_retriever:
import airflow
from airflow import DAG
from airflow.operators.http_operator import SimpleHttpOperator
from airflow.operators.sensors import HttpSensor
from…
![](../../users/profiles/5206845.webp)
Rachel Lanman
- 349
- 1
- 4
- 15
12
votes
0 answers
Is it possible to write a luigi wrapper task that tolerates failed sub-tasks?
I have a luigi task that performs some non-stable computations. Think of an optimization process that sometimes does not converge.
import luigi
MyOptimizer(luigi.Task):
input_param: luigi.Parameter()
output_filename =…
![](../../users/profiles/2934048.webp)
DalyaG
- 2,067
- 2
- 12
- 14
11
votes
1 answer
Implementing luigi dynamic graph configuration
I am new to luigi, came across it while designing a pipeline for our ML efforts. Though it wasn't fitted to my particular use case it had so many extra features I decided to make it fit.
Basically what I was looking for was a way to be able to…
![](../../users/profiles/374437.webp)
Veltzer Doron
- 843
- 1
- 10
- 28
7
votes
1 answer
Truncate DynamoDb or rewrite data via Data Pipeline
There is possibility to dump DynamoDb via Data Pipeline and also import data in DynamoDb. Import is going well, but all the time data appends to already exists data in DynamoDb.
For now I found work examples that scan DynamoDb and delete items one…
![](../../users/profiles/1423082.webp)
Vladimir Gilevich
- 793
- 8
- 15
5
votes
1 answer
Pipeline from AWS RDS to S3 using Glue
I was trying AWS Glue to migrate our current data pipeline from python scripts to AWS Glue . I was able to setup a crawler to pull the schema for the different postgres databases . However, I am facing issues in pulling data from Postgres RDS to S3…
![](../../users/profiles/4365248.webp)
Eshank Jain
- 139
- 2
- 11
3
votes
1 answer
Dataflow with python flex template - launcher timeout
I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck in "Queued" status for a while and then fail with…
![](../../users/profiles/1019366.webp)
Kazuki
- 1,276
- 11
- 25
3
votes
0 answers
Unable to find the relevant tensor remote_handle: Op ID: 14738, Output num: 0
I am using a colab pro TPU instance for the purpose of patch image classification.
i'm using tensorflow version 2.3.0.
When calling model.fit I get the following error: InvalidArgumentError: Unable to find the relevant tensor remote_handle: Op ID:…
![](../../users/profiles/8777119.webp)
Pooya448
- 43
- 3
3
votes
2 answers
Undo/rollback the effects of a data processing pipeline
I have a workflow that I'll describe as follows:
[ Dump(query) ] ---+
|
+---> [ Parquet(dump, schema) ] ---> [ Hive(parquet) ]
|
[ Schema(query) ] ---+
Where:
query is a query to an…
![](../../users/profiles/3314107.webp)
stefanobaghino
- 8,477
- 3
- 28
- 52
3
votes
2 answers
Bulk add ttl column to dynamodb table
I have a use case where I need to add ttl column to the existing table. Currently, this table has more than 2 billion records.
Is there any existing solution build around same? Or Should be emr is the path forward?
![](../../users/profiles/428705.webp)
Vivek Goel
- 19,274
- 22
- 97
- 172
2
votes
0 answers
Feasible Streaming Suggestions | Is it possible to use Apache Nifi + Apache Beam (on Flink Cluster) with Real Time Streaming Data
So, I am very very new to all the Apache Frameworks I am trying to use. I want your suggestions on a couple of workflow design for an IoT streaming application:
As we have NiFi connectors available for Flink, and we can easily use Beam abstraction…
![](../../users/profiles/6566511.webp)
Subham Agrawal
- 35
- 7
2
votes
1 answer
Is there a way in airflow where a Daily DAG is dependent on weekly (on weekends) DAG?
I have these Dags DAG_A (runs daily) , DAG_B (runs mon-fri) and DAG_C (runs on sat and sun) where DAG_A is dependent on both DAG_B and DAG_C.
I tried setting the dependencies using External Task Sensor but everytime my scheduler stops running and…
![](../../users/profiles/10006192.webp)
Lalitha
- 21
- 2
2
votes
2 answers
Google data fusion Execution error "INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0."
I am trying load a Simple CSV file from GCS to BQ using Google Data Fusion Free version. The pipeline is failing with error . it reads
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Insufficient…
![](../../users/profiles/11953315.webp)
user11953315
- 23
- 3
2
votes
1 answer
Firehose datapipeline limitations
My use-case is as follows:
I have JSON data coming in which needs to be stored in S3 in parquet format. So far so good, I can create a schema in Glue and attach a "DataFormatConversionConfiguration" to my firehose stream. BUT the data is coming from…
![](../../users/profiles/2572341.webp)
Dexter
- 1,510
- 2
- 15
- 32
2
votes
1 answer
Workflow orchestration tool compatible with Windows Server 2013?
My current project requires automation and scheduled execution of a number of tasks (copy a file, send an email when a new file arrives in a directory, execute an analytics job, etc). My plan is to write a number of individual shell scripts for each…
![](../../users/profiles/2172633.webp)
Praveen Thirukonda
- 305
- 1
- 3
- 15