5

I have a bunch of Parquet files on S3, i want to load them into redshift in most optimal way.

Each file is split into multiple chunks......what is the most optimal way to load data from S3 into Redshift?

Also, how do you create the target table definition in Redshift? Is there a way to infer schema from Parquet and create table programatically? I believe there is a way to do this using Redshift spectrum, but i want to know if this can be done in scripting.

Appreciate your help!

I am considering all AWS tools such as Glue, Lambda etc to do this the most optimal way(in terms of performance, security and cost).

Richard
  • 291
  • 1
  • 4
  • 19
  • if youre planning to use glue,then create a and use the glue metastore directly in spectrum.The glue crawler is helpful if you want to update the metastore peridically – theDbGuy Sep 14 '18 at 12:15

2 Answers2

6

The Amazon Redshift COPY command can natively load Parquet files by using the parameter:

FORMAT AS PARQUET

See: Amazon Redshift Can Now COPY from Parquet and ORC File Formats

The table must be pre-created; it cannot be created automatically.

Also note from COPY from Columnar Data Formats - Amazon Redshift:

COPY inserts values into the target table's columns in the same order as the columns occur in the columnar data files. The number of columns in the target table and the number of columns in the data file must match.

John Rotenstein
  • 165,783
  • 13
  • 223
  • 298
  • thank you John for your reply, in my case i only have the Parquet file, how do i identify the schema? I did the "Glue crawler", but not sure if i can automate this and export the crawler output to postgres compatible CREATE TABLE statement..... – Richard Sep 06 '18 at 02:09
  • I guess you'll need to [find a tool](https://www.google.com/search?q=extract+schema+from+parquet+file) that can extract the schema. – John Rotenstein Sep 06 '18 at 02:19
1

use parquet-tools from GitHub to dissect the file :

parquet-tool schema <filename> #will dump the schema w/datatypes
parquet-tool head <filename> #will dump the first 5 data structures

Use the jsonpaths file to specify mappings

toshiro92
  • 1,149
  • 4
  • 24
  • 38
grimgaunt
  • 11
  • 2