Offloading data files from Amazon Redshift to Amazon S3 in Parquet format

Question

I would like to unload data files from Amazon Redshift to Amazon S3 in Apache Parquet format inorder to query the files on S3 using Redshift Spectrum. I have explored every where but I couldn't find anything about how to offload the files from Amazon Redshift to S3 using Parquet format. Is this feature not supported yet or was I not able to find any documentation about it. Could somebody who has worked on it share some light on this? Thank you.

score 10 · Answer 1 · edited Dec 13 '19 at 17:02

Redshift Unload to Parquet file format is supported as of Dec 2019:

UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET

It is mentioned in Redshift Features

and also updated in Unload Document

with an example provided in the Unload Examples Document

Excerpt of the official documentation:

The following example unloads the LINEITEM table in Parquet format, partitioned by the l_shipdate column.

unload ('select * from lineitem')
to 's3://mybucket/lineitem/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
PARQUET
PARTITION BY (l_shipdate);

Assuming four slices, the resulting Parquet files are dynamically partitioned into various folders.

s3://mybucket/lineitem/l_shipdate=1992-01-02/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet
s3://mybucket/lineitem/l_shipdate=1992-01-03/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet
s3://mybucket/lineitem/l_shipdate=1992-01-04/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet

Hi secdatabase, welcome to StackOverflow. Linking to external sites can be useful, but please include a quick summary of the important points from them in case the link breaks in the future. — Brydenr, Dec 04 '19 at 16:22

score 7 · Answer 2 · answered Feb 22 '18 at 15:38

7

A bit late, but Spectrify does exactly this.

answered Feb 22 '18 at 15:38

Colin Nichols

686
3
12

do you have any documentation on how to configure s3 paths and aws credentials for spectrify from python script? – Viv May 08 '19 at 13:39

Kirk Broadhurst · Answer 3 · 2019-08-29T21:08:21.603

3

You can't do this. Redshift doesn't know about Parquet (although you can read Parquet files through the Spectrum abstraction).

You can UNLOAD to text files. They can be encrypted or zipped, but they are only ever flat text files.

Looks like this is now supported:

https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/

edited Aug 29 '19 at 21:08

answered Feb 07 '18 at 22:00

Kirk Broadhurst

25,044
13
91
149

1

Just to clarify - the blogpost mentioned is about copying INTO Redshift, not out to S3. – mbourgon Sep 24 '19 at 16:01

score 1 · Answer 4 · answered Feb 08 '18 at 09:25

1

Unfortunately, so far, AWS Redshift did not extend its ability to read the parquet format.

Though you can do one of the following :

Use AWS Spectrum to read them.
Use a crawler from Amazon Glue to convert it for you.

Till today, there is no support for Apache Parquet in AWS out of the shelf.

I hope this helps.

answered Feb 08 '18 at 09:25

Tanmoy Bhattacharjee

704
6
19

The question is about getting data out of Redshift. – halil Apr 08 '19 at 14:26

score 0 · Answer 5 · answered May 11 '18 at 07:18

A great solution 'Spectrify' does this but if you don't want to do it using the AWS Services; you could use Spark on EMR + Databricks to read data from Redshift and write it into S3 in parquet format.

The following link will give you an idea to do the same

https://github.com/aws-samples/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

Offloading data files from Amazon Redshift to Amazon S3 in Parquet format

5 Answers5