Questions tagged [aws-data-pipeline]

Use amazon-data-pipeline tag instead

Simple service to transfer data between Amazon data storage services, kick off Elastic MapReduce jobs, and connect with outside data services.

67 questions
19
votes
1 answer

AWS Data Pipeline vs Step Functions

I am working on a problem where we intend to perform multiple transformations on data using EMR (SparkSQL). After going through the documentation of AWS Data Pipelines and AWS Step Functions, I am slightly confused as to what is the use-case each…
4
votes
1 answer

How to export an AWS DynamoDB table to an S3 Bucket?

I have a DynamoDB table that has 1.5 million records / 2GB. How to export this to an S3? The AWS data pipeline method to do this worked with a small table. But i am facing issues with exporting the 1.5 million record table to my S3. At my initial…
3
votes
0 answers

Import file data from S3 into RDS with transformation steps

I'm a novice AWS user and I'm trying to solve a use case where I need to import data from a csv that is dropped into an S3 bucket to RDS. I have a csv file that will be uploaded to an S3 bucket, from there I want to run a custom Python script to…
Jackson
  • 4,801
  • 4
  • 26
  • 38
3
votes
2 answers

Scheduling data extraction from AWS Redshift to S3

I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets. Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored…
2
votes
1 answer

AWS data pipeline name tag option for EC2 resource

I'm running a shell activity in EC2 resource sample json for creating EC2 resource. { "id" : "MyEC2Resource", "type" : "Ec2Resource", "actionOnTaskFailure" : "terminate", "actionOnResourceFailure" : "retryAll", "maximumRetries" : "1", …
2
votes
0 answers

Data Pipeline: Stop creating empty file in S3

I am using AWS data pipeline to take backup of RDS table data on certain condition and store that backup CSV file in S3 bucket. It's working fine when there is data to backup but when there is no data then also data pipeline is creating empty file…
Sachin
  • 2,517
  • 9
  • 22
2
votes
2 answers

AWS Data Pipeline: Issue with permissions S3 Access for IAM role

I'm using the Load S3 data into RDS MySql table template in AWS Data Pipeline to import csv's from a S3 bucket into our RDS MySql. However I (as IAM user with full-admin rights) run into a warning I can't solve: Object:Ec2Instance - WARNING: Could…
2
votes
1 answer

Permissions for creating and attaching EBS Volume to an EC2Resource i AWS Data Pipeline

I need more local disk than available to EC2Resources in an AWS Data Pipline. The simplest solution seems to be to create and attach an EBS volume. I have added EC2:CreateVolume og EC2:AttachVolume policies to both DataPipelineDefaultRole and…
1
vote
1 answer

Data migration from S3 to RDS

I am working on a requirement, where i am doing multipart upload of the csv file from on prem server to S3 Bucket. To achieve this using AWS Lambda I create a presigned url and use this url i am uploading the csv file. Now, once i have the file in…
1
vote
0 answers

Which file format is suitable for unstructured data?

I am creating a data-repository more like creating data-lake for no-SQL db. I have some field which doesn't have a proper schema. They have mix type object like field value is {a:2} or {b:2,c:4, a: {1,2}}, etc. I can use CSV format so I can save…
Manish Trivedi
  • 3,165
  • 4
  • 21
  • 27
1
vote
1 answer

Airflow - Tasks that write files locally (GCS)

I'm in the process of building a few pipelines in Airflow after having spent the last few years using AWS DataPipeline. I have a couple questions I'm foggy on and hope for some clarification. For context, I'm using Google Cloud Composer. In…
1
vote
0 answers

Is there a way PigActivity in AWS Pipeline can read schema from Athena tables created on S3 buckets

I have lot of legacy pig scripts that run on on-prem cluster, we are trying to move to AWS Data Pipeline (PigActivity) and want to make these pig scripts can read data from S3 buckets where my source data would reside. On-Prem Pig scripts use…
1
vote
0 answers

ShellCommandActivity timing out despite setting 3 hours as the timeout value

I'm using a cloudformation template to spin up a EC2 instance to execute a shell script. For the EC2 resource, I've specified the terminateAfter value as 3 Hours. Similarly, for the ShellCommandActivity I've specified the attemptTimeout value as 3…
user795028
  • 93
  • 9
1
vote
1 answer

Export existing DynamoDB items to Lambda Function

Is there any AWS managed solution which would allow be to perform what is essentially a data migration using DynamoDB as the source and a Lambda function as the sink? I’m setting up a Lambda to process DynamoDB streams, and I’d like to be able to…
1
vote
2 answers

Spark Streaming scheduling best practices

We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15…
1
2 3 4 5