Given a data-source of 1.4 TB of Parquet data on S3 partitioned by a timestamp field (so partitions are year
- month
- day
) I am querying a specific day of data (2.6 GB of data) and retrieving all available fields in the Parquet files via Redshift Spectrum with this query:
SELECT *
FROM my_external_schema.my_external_table
WHERE year = '2020' and month = '01' and day = '01'
The table is made available via a Glue Crawler that points at the top level "folder" in S3; this creates a Database and then via this command I link the Database to the new external schema:
create external schema my_external_schema from data catalog
database 'my_external_schema'
iam_role 'arn:aws:iam::123456789:role/my_role'
region 'my-region-9';
Analysing the table in my IDE I can see the table is generated by this statement:
create external table my_external_schema.my_external_table
(
id string,
my_value string,
my_nice_value string
)
partitioned by (year string, month string, day string)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
with serdeproperties ('serialization.format'='1')
stored as
inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 's3://my-bucket/my/location/'
table properties ('CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='my_crawler');
When I analyse the query from Redshift I see it was scanned ~86 GB of data instead.
How's that possible? It is a concern because Redshift bills based on the amount of data scanned and looks like the service is scanning around 40 times the actual amount of data is in that partition.
I also tried to execute the same query in Athena and there I get only 2.55 GB of data scanned (definitely more reasonable).
I can't give too many details on the cluster size but assume that those 86GB of scanned data would fit in the cluster's Memory.