Highest Voted 'aws-data-pipeline' Questions

19

votes

1 answer

AWS Data Pipeline vs Step Functions

I am working on a problem where we intend to perform multiple transformations on data using EMR (SparkSQL). After going through the documentation of AWS Data Pipelines and AWS Step Functions, I am slightly confused as to what is the use-case each…

amazon-web-services aws-step-functions aws-data-pipeline

asked Mar 08 '19 at 10:50

archilius

462
4
14

4

votes

1 answer

How to export an AWS DynamoDB table to an S3 Bucket?

I have a DynamoDB table that has 1.5 million records / 2GB. How to export this to an S3? The AWS data pipeline method to do this worked with a small table. But i am facing issues with exporting the 1.5 million record table to my S3. At my initial…

amazon-web-services amazon-s3 amazon-dynamodb amazon-emr aws-data-pipeline

asked Aug 29 '20 at 12:54

Afnas

71
4

3

votes

0 answers

Import file data from S3 into RDS with transformation steps

I'm a novice AWS user and I'm trying to solve a use case where I need to import data from a csv that is dropped into an S3 bucket to RDS. I have a csv file that will be uploaded to an S3 bucket, from there I want to run a custom Python script to…

python amazon-web-services etl aws-glue aws-data-pipeline

asked Nov 17 '18 at 21:20

Jackson

4,801
4
26
38

3

votes

2 answers

Scheduling data extraction from AWS Redshift to S3

I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets. Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored…

amazon-web-services amazon-s3 amazon-redshift aws-glue aws-data-pipeline

asked Nov 15 '17 at 10:23

Spark-Beginner

1,198
5
16
24

2

votes

1 answer

AWS data pipeline name tag option for EC2 resource

I'm running a shell activity in EC2 resource sample json for creating EC2 resource. { "id" : "MyEC2Resource", "type" : "Ec2Resource", "actionOnTaskFailure" : "terminate", "actionOnResourceFailure" : "retryAll", "maximumRetries" : "1", …

amazon-web-services amazon-ec2 amazon-data-pipeline aws-data-pipeline

asked Jun 03 '20 at 12:22

Johne Doe

373
5
20

2

votes

0 answers

Data Pipeline: Stop creating empty file in S3

I am using AWS data pipeline to take backup of RDS table data on certain condition and store that backup CSV file in S3 bucket. It's working fine when there is data to backup but when there is no data then also data pipeline is creating empty file…

amazon-web-services amazon-s3 amazon-rds aws-data-pipeline

asked May 24 '19 at 08:57

Sachin

2,517
9
22

2

votes

2 answers

AWS Data Pipeline: Issue with permissions S3 Access for IAM role

I'm using the Load S3 data into RDS MySql table template in AWS Data Pipeline to import csv's from a S3 bucket into our RDS MySql. However I (as IAM user with full-admin rights) run into a warning I can't solve: Object:Ec2Instance - WARNING: Could…

amazon-web-services amazon-s3 amazon-ec2 amazon-data-pipeline aws-data-pipeline

asked Feb 01 '19 at 09:27

jeroen

21
3

2

votes

1 answer

Permissions for creating and attaching EBS Volume to an EC2Resource i AWS Data Pipeline

I need more local disk than available to EC2Resources in an AWS Data Pipline. The simplest solution seems to be to create and attach an EBS volume. I have added EC2:CreateVolume og EC2:AttachVolume policies to both DataPipelineDefaultRole and…

amazon-web-services amazon-iam aws-iam aws-data-pipeline

asked Nov 07 '18 at 13:40

Knut Hellan

41
6

1

vote

1 answer

Data migration from S3 to RDS

I am working on a requirement, where i am doing multipart upload of the csv file from on prem server to S3 Bucket. To achieve this using AWS Lambda I create a presigned url and use this url i am uploading the csv file. Now, once i have the file in…

amazon-web-services amazon-s3 aws-lambda aws-dms aws-data-pipeline

asked Jun 10 '20 at 07:41

Nimmo

85
7

1

vote

0 answers

Which file format is suitable for unstructured data?

I am creating a data-repository more like creating data-lake for no-SQL db. I have some field which doesn't have a proper schema. They have mix type object like field value is {a:2} or {b:2,c:4, a: {1,2}}, etc. I can use CSV format so I can save…

amazon-s3 data-lake aws-data-pipeline

asked Apr 09 '20 at 09:39

Manish Trivedi

3,165
4
21
27

1

vote

1 answer

Airflow - Tasks that write files locally (GCS)

I'm in the process of building a few pipelines in Airflow after having spent the last few years using AWS DataPipeline. I have a couple questions I'm foggy on and hope for some clarification. For context, I'm using Google Cloud Composer. In…

airflow amazon-data-pipeline google-cloud-composer aws-data-pipeline

asked Dec 23 '19 at 20:49

JW2

179
11

1

vote

0 answers

Is there a way PigActivity in AWS Pipeline can read schema from Athena tables created on S3 buckets

I have lot of legacy pig scripts that run on on-prem cluster, we are trying to move to AWS Data Pipeline (PigActivity) and want to make these pig scripts can read data from S3 buckets where my source data would reside. On-Prem Pig scripts use…

amazon-s3 apache-pig amazon-athena aws-data-pipeline

asked Sep 04 '19 at 02:34

manojd7sto

61
4

1

vote

0 answers

ShellCommandActivity timing out despite setting 3 hours as the timeout value

I'm using a cloudformation template to spin up a EC2 instance to execute a shell script. For the EC2 resource, I've specified the terminateAfter value as 3 Hours. Similarly, for the ShellCommandActivity I've specified the attemptTimeout value as 3…

aws-data-pipeline

asked May 09 '19 at 14:07

user795028

93
9

1

vote

1 answer

Export existing DynamoDB items to Lambda Function

Is there any AWS managed solution which would allow be to perform what is essentially a data migration using DynamoDB as the source and a Lambda function as the sink? I’m setting up a Lambda to process DynamoDB streams, and I’d like to be able to…

aws-lambda amazon-dynamodb aws-glue aws-batch aws-data-pipeline

asked Apr 05 '19 at 16:23

Matthew Pope

5,551
1
19
36

1

vote

2 answers

Spark Streaming scheduling best practices

We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15…

pyspark spark-streaming amazon-emr amazon-kinesis aws-data-pipeline

asked Feb 03 '19 at 02:58

RockerZ

61
8

Questions tagged [aws-data-pipeline]