3

I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets. Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored to S3 directly. I am looking for a solution which can be parameterised and scheduled in AWS.

grepit
  • 16,512
  • 5
  • 83
  • 71
Spark-Beginner
  • 1,198
  • 5
  • 16
  • 24
  • Use shell script with UNLOAD command and use a config file to parameterize. Setup cronjob to run on scheduled basis. Use an EC2 for setting up all these. – hadooper Nov 15 '17 at 15:44

2 Answers2

14

Consider using AWS Data Pipeline for this.

AWS Data Pipeline is AWS service that allows you to define and schedule regular jobs. These jobs are referred to as pipelines. Pipeline contains a business logic of the work required, for example, extracting data from Redshift to S3. You can schedule a pipeline to run however often you require e.g. daily.

Pipeline is defined by you, you can even version control it. You can prepare a pipeline definition in a browser using Data Pipeline Architect or compose it using JSON file locally on your computer. Pipeline definition is composed of components, such as, Redshift database, S3 node , SQL activity, as well as parameters, for example to specifying S3 path to use for extracted data.

AWS Data Pipeline service handles scheduling, dependency between components in your pipeline, monitoring and error handling.

For your specific use case, I would consider the following options:

Option 1

Define pipeline with the following components: SQLDataNode and S3DataNode. SQLDataNode would reference your Redshift database and SELECT query to use to extract your data. S3DataNode would point to S3 path to be used to store your data. You add a CopyActivity activity to copy data from SQLDataNode to S3DataNode. When such pipeline runs, it will retrieve data from Redshift using SQLDataNode and copy that data to S3DataNode using CopyActivity. S3 path in S3DataNode can be parameterised so it is different every time you run a pipeline.

Option 2

Firstly, define SQL query with UNLOAD statement to be used to unload your data to S3. Optionally, you can save it in a file and upload to S3. Use SQLActivity component to specify SQL query to execute in Redshift database. SQL query in SQLActivity can be a reference to S3 path where you stored your query (optionally), or just a query itself. Whenever a pipeline runs, it will connect to Redshift and execute SQL query which stores the data in S3. Constraints of option 2: in UNLOAD statement, S3 path is static. If you plan to store every data extract in a separate S3 path, you will have to modify UNLOAD statement to use another S3 path every time you run it which is not out-of-the-box function.

Where do these pipelines run?

On EC2 instance with a TaskRunner, a tool provided by AWS to run data pipelines. You can start that instance automatically at the time when pipeline runs, or you can reference already running instance with a TaskRunner installed on it. You have to make sure that EC2 instance is allowed to connect to your Redshift database.

Relevant documentation:

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftdatabase.html

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqldatanode.html

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqlactivity.html

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-using-task-runner.html

Pawel
  • 414
  • 4
  • 5
1

I think Pawel has answered this correctly , I'm just adding details on option two for anyone who wants to implement this:

  1. Go to "Data Pipeline" from AWS console
  2. Click on "New Pipeline" on top right corner page
  3. Edit each field in this json file(after copying to your favorite editor) and update the fields which has "$NEED_TO_UPDATE_THIS_WITH_YOURS" with the correct value that pertains to your AWS environment and save it as data_pipeline_template.json some where on your computer
  4. Go back to AWS Console again, Click on "Load Local File" for the source field and upload the json file

if you are not able to upload it because you may be getting some error related to your database instances etc then follow these steps:

  1. Go to "Data Pipeline" from AWS console
  2. Click on "New Pipeline" on top right corner page
  3. Populate all the fields manually (see below) enter image description here
  4. Click on "Edit in Architect" at the bottom of the page
  5. Implement the same activities and resources as below , again make sure your are adding the correct values such as your Database JDBC connection etc

enter image description here

grepit
  • 16,512
  • 5
  • 83
  • 71