Questions tagged [data-pipeline]

95 questions
18
votes
4 answers

Feeding .npy (numpy files) into tensorflow data pipeline

Tensorflow seems to lack a reader for ".npy" files. How can I read my data files into the new tensorflow.data.Dataset pipline? My data doesn't fit in memory. Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as…
Sluggish Crow
  • 183
  • 1
  • 1
  • 5
13
votes
2 answers

How to access the response from Airflow SimpleHttpOperator GET request

I'm learning Airflow and have a simple question. Below is my DAG called dog_retriever: import airflow from airflow import DAG from airflow.operators.http_operator import SimpleHttpOperator from airflow.operators.sensors import HttpSensor from…
Rachel Lanman
  • 349
  • 1
  • 4
  • 15
12
votes
0 answers

Is it possible to write a luigi wrapper task that tolerates failed sub-tasks?

I have a luigi task that performs some non-stable computations. Think of an optimization process that sometimes does not converge. import luigi MyOptimizer(luigi.Task): input_param: luigi.Parameter() output_filename =…
DalyaG
  • 2,067
  • 2
  • 12
  • 14
11
votes
1 answer

Implementing luigi dynamic graph configuration

I am new to luigi, came across it while designing a pipeline for our ML efforts. Though it wasn't fitted to my particular use case it had so many extra features I decided to make it fit. Basically what I was looking for was a way to be able to…
Veltzer Doron
  • 843
  • 1
  • 10
  • 28
7
votes
1 answer

Truncate DynamoDb or rewrite data via Data Pipeline

There is possibility to dump DynamoDb via Data Pipeline and also import data in DynamoDb. Import is going well, but all the time data appends to already exists data in DynamoDb. For now I found work examples that scan DynamoDb and delete items one…
5
votes
1 answer

Pipeline from AWS RDS to S3 using Glue

I was trying AWS Glue to migrate our current data pipeline from python scripts to AWS Glue . I was able to setup a crawler to pull the schema for the different postgres databases . However, I am facing issues in pulling data from Postgres RDS to S3…
3
votes
1 answer

Dataflow with python flex template - launcher timeout

I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck in "Queued" status for a while and then fail with…
3
votes
0 answers

Unable to find the relevant tensor remote_handle: Op ID: 14738, Output num: 0

I am using a colab pro TPU instance for the purpose of patch image classification. i'm using tensorflow version 2.3.0. When calling model.fit I get the following error: InvalidArgumentError: Unable to find the relevant tensor remote_handle: Op ID:…
3
votes
2 answers

Undo/rollback the effects of a data processing pipeline

I have a workflow that I'll describe as follows: [ Dump(query) ] ---+ | +---> [ Parquet(dump, schema) ] ---> [ Hive(parquet) ] | [ Schema(query) ] ---+ Where: query is a query to an…
stefanobaghino
  • 8,477
  • 3
  • 28
  • 52
3
votes
2 answers

Bulk add ttl column to dynamodb table

I have a use case where I need to add ttl column to the existing table. Currently, this table has more than 2 billion records. Is there any existing solution build around same? Or Should be emr is the path forward?
Vivek Goel
  • 19,274
  • 22
  • 97
  • 172
2
votes
0 answers

Feasible Streaming Suggestions | Is it possible to use Apache Nifi + Apache Beam (on Flink Cluster) with Real Time Streaming Data

So, I am very very new to all the Apache Frameworks I am trying to use. I want your suggestions on a couple of workflow design for an IoT streaming application: As we have NiFi connectors available for Flink, and we can easily use Beam abstraction…
2
votes
1 answer

Is there a way in airflow where a Daily DAG is dependent on weekly (on weekends) DAG?

I have these Dags DAG_A (runs daily) , DAG_B (runs mon-fri) and DAG_C (runs on sat and sun) where DAG_A is dependent on both DAG_B and DAG_C. I tried setting the dependencies using External Task Sensor but everytime my scheduler stops running and…
Lalitha
  • 21
  • 2
2
votes
2 answers

Google data fusion Execution error "INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0."

I am trying load a Simple CSV file from GCS to BQ using Google Data Fusion Free version. The pipeline is failing with error . it reads com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Insufficient…
2
votes
1 answer

Firehose datapipeline limitations

My use-case is as follows: I have JSON data coming in which needs to be stored in S3 in parquet format. So far so good, I can create a schema in Glue and attach a "DataFormatConversionConfiguration" to my firehose stream. BUT the data is coming from…
2
votes
1 answer

Workflow orchestration tool compatible with Windows Server 2013?

My current project requires automation and scheduled execution of a number of tasks (copy a file, send an email when a new file arrives in a directory, execute an analytics job, etc). My plan is to write a number of individual shell scripts for each…
1
2 3 4 5 6 7