10

We currently generate a daily CSV export that we upload to an S3 bucket, into the following structure:

<report-name>
|--reportDate-<date-stamp>
    |-- part0.csv.gz
    |-- part1.csv.gz

We want to be able to run reports partitioned by daily export.

According to this page, you can partition data in Redshift Spectrum by a key which is based on the source S3 folder where your Spectrum table sources its data. However, from the example, it looks like you need an ALTER statement for each partition:

alter table spectrum.sales_part
add partition(saledate='2008-01-01') 
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2008-01/';

alter table spectrum.sales_part
add partition(saledate='2008-02-01') 
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';

Is there any way to set the table up so that data is automatically partitioned by the folder it comes from, or do we need a daily job to ALTER the table to add that day's partition?

GoatInTheMachine
  • 3,155
  • 3
  • 22
  • 31

2 Answers2

9

Solution 1:

At max 20000 partitions can be created per table. You can create a one-time script to add the partitions (at max 20k) for all the future s3 partition folders.

For eg.

If folder s3://bucket/ticket/spectrum/sales_partition/saledate=2017-12/ doesn't exist, you can even add partition for that.

alter table spectrum.sales_part
add partition(saledate='2017-12-01') 
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2017-12/';

Solution 2:

https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/

Sumit Saurabh
  • 1,047
  • 1
  • 14
  • 26
  • Thats perfect, I have no idea why I didn't think of that, thank you – GoatInTheMachine Nov 10 '17 at 14:46
  • 1
    The docs for this also says "You can add multiple partitions ... if you use the AWS Glue catalog, you can add up to 100 partitions in a single statement" just by repeating the 2nd and 3rd lines in your example for each partition. https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html#c-spectrum-external-tables-partitioning – Davos Aug 05 '19 at 13:25
  • 1
    I can't find any info on maximum number of partitions in AWS docs. Could you link? Update: I created 20001 partitions for a table without any problems. Maybe the 20k limit is deprecated. – Karl Anka Nov 04 '20 at 10:12
0

Another precise way to go about it: Create a Lambda job that is triggered on the ObjectCreated notification from the S3 bucket, and run the SQL to add the partition:

alter table tblname ADD IF NOT EXISTS PARTITION (partition clause) localtion s3://mybucket/localtion

Vinayak Thatte
  • 321
  • 2
  • 4