redshift Unload operation causing redundant data

Question

We use UNLOAD commands to run some transformation on s3-based external tables and publish data into a different s3 bucket in PARQUET format.

I use ALLOWOVERWRITE option in the unload operation to replace the files if they already exist. This works fine for most of the cases but inserts duplicate files for the same data at times which causes external table to show duplicate numbers.

For eg, if the parquet in the partition is 0000_part_00.parquet which contains complete data.In the next run, unload is expected to overwrite this file but instead inserts new file 0000_part_01.parquet which doubles the total output.

This again would not repeat if I just clean up entire partition and rerun again. This inconsistency is making our system unreliable.

unload (<simple select statement>)
to 's3://<s3 bucket>/<prefix>/'
iam_role '<iam-role>' allowoverwrite
PARQUET
PARTITION BY (partition_col1, partition_col2);

Thank you.

You should use `CLEANPATH` instead of `ALLOWOVERWRITE` – cdabel Apr 16 '21 at 22:54 — cdabel, Apr 16 '21 at 22:54

score 2 · Answer 1 · answered Jul 10 '20 at 10:32

From my experience the ALLOWOVERWRITE parameter is only based on the generated file names: so a result is overwritten only if 2 files have the same name.

This parameter works in most of the cases but in this domain "most of the cases" is not good enough. I stopped using it since then (and I was quite disappointed). What I do instead is manually delete the files from S3 console (or actually move them in a staging folder) and then unloading the data without relying on the ALLOWOVERWRITE parameter.

Also mentioned in comments of this answer https://stackoverflow.com/a/61594603/4725074

cdabel · Answer 2 · 2021-04-16T22:53:27.583

To prevent redundant data, you must use Redshift's CLEANPATH option in your UNLOAD statement. Note the difference, from the documentation (Perhaps AWS could clear this up a bit more):

ALLOWOVERWRITE
By default, UNLOAD fails if it finds files that it would possibly overwrite. If ALLOWOVERWRITE is specified, UNLOAD overwrites existing files, including the manifest file.

CLEANPATH
The CLEANPATH option removes existing files located in the Amazon S3 path specified in the TO clause before unloading files to the specified location.
If you include the PARTITION BY clause, existing files are removed only from the partition folders to receive new files generated by the UNLOAD operation.
You must have the s3:DeleteObject permission on the Amazon S3 bucket. For information, see Policies and Permissions in Amazon S3 in the Amazon Simple Storage Service Console User Guide. Files that you remove by using the `CLEANPATH` option are permanently deleted and can't be recovered.
You can't specify the `CLEANPATH` option if you specify the `ALLOWOVERWRITE` option.

Therefore, as @Vzzarr says, ALLOWOVERWRITE only overwrites files that share the same names as the incoming file name. For recurring unload operations that do not require the state of the past data to remain intact, then you must use CLEANPATH.

And note that you cannot use both ALLOWOVERWRITE and CLEANPATH in the same UNLOAD statement.

Here's an example:

f"""
UNLOAD ('{your_query}')
TO 's3://{destination_prefix}/'
iam_role '{IAM_ROLE_ARN}'
PARQUET
MAXFILESIZE 4 GB
MANIFEST verbose
CLEANPATH
"""

redshift Unload operation causing redundant data

2 Answers2