We use UNLOAD commands to run some transformation on s3-based external tables and publish data into a different s3 bucket in PARQUET format.
I use ALLOWOVERWRITE option in the unload operation to replace the files if they already exist. This works fine for most of the cases but inserts duplicate files for the same data at times which causes external table to show duplicate numbers.
For eg, if the parquet in the partition is 0000_part_00.parquet which contains complete data.In the next run, unload is expected to overwrite this file but instead inserts new file 0000_part_01.parquet which doubles the total output.
This again would not repeat if I just clean up entire partition and rerun again. This inconsistency is making our system unreliable.
unload (<simple select statement>)
to 's3://<s3 bucket>/<prefix>/'
iam_role '<iam-role>' allowoverwrite
PARQUET
PARTITION BY (partition_col1, partition_col2);
Thank you.