10

I would like to unload data files from Amazon Redshift to Amazon S3 in Apache Parquet format inorder to query the files on S3 using Redshift Spectrum. I have explored every where but I couldn't find anything about how to offload the files from Amazon Redshift to S3 using Parquet format. Is this feature not supported yet or was I not able to find any documentation about it. Could somebody who has worked on it share some light on this? Thank you.

Teja
  • 11,878
  • 29
  • 80
  • 137

5 Answers5

10

Redshift Unload to Parquet file format is supported as of Dec 2019:

UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET

It is mentioned in Redshift Features

and also updated in Unload Document

with an example provided in the Unload Examples Document

Excerpt of the official documentation:

The following example unloads the LINEITEM table in Parquet format, partitioned by the l_shipdate column.

unload ('select * from lineitem')
to 's3://mybucket/lineitem/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
PARQUET
PARTITION BY (l_shipdate);

Assuming four slices, the resulting Parquet files are dynamically partitioned into various folders.

s3://mybucket/lineitem/l_shipdate=1992-01-02/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet
s3://mybucket/lineitem/l_shipdate=1992-01-03/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet
s3://mybucket/lineitem/l_shipdate=1992-01-04/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet
secdatabase
  • 101
  • 1
  • 3
  • Hi secdatabase, welcome to StackOverflow. Linking to external sites can be useful, but please include a quick summary of the important points from them in case the link breaks in the future. – Brydenr Dec 04 '19 at 16:22
7

A bit late, but Spectrify does exactly this.

Colin Nichols
  • 686
  • 3
  • 12
  • do you have any documentation on how to configure s3 paths and aws credentials for spectrify from python script? – Viv May 08 '19 at 13:39
3

You can't do this. Redshift doesn't know about Parquet (although you can read Parquet files through the Spectrum abstraction).

You can UNLOAD to text files. They can be encrypted or zipped, but they are only ever flat text files.


Looks like this is now supported:

https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/

Kirk Broadhurst
  • 25,044
  • 13
  • 91
  • 149
1

Unfortunately, so far, AWS Redshift did not extend its ability to read the parquet format.

Though you can do one of the following :

  1. Use AWS Spectrum to read them.
  2. Use a crawler from Amazon Glue to convert it for you.

Till today, there is no support for Apache Parquet in AWS out of the shelf.

I hope this helps.

0

A great solution 'Spectrify' does this but if you don't want to do it using the AWS Services; you could use Spark on EMR + Databricks to read data from Redshift and write it into S3 in parquet format.

The following link will give you an idea to do the same

https://github.com/aws-samples/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

Mukund
  • 808
  • 1
  • 9
  • 17