So, I have this folder, let's call it /data
.
And it has partitions in it, e.g.:
/data/partition1
, /data/partition2
.
I read new data from kafka, and imagine I only need to update /data/partition2
. I do:
dataFrame
.write
.mode(SaveMode.Overwrite)
.partitionBy("date", "key")
.option("header", "true")
.format(format)
.save("/data")
and it successfully updates /data/partition2
, but /data/partition1
is gone... How can I force spark's SaveMode.Overwrite
to not touch HDFS partitions that don't need to be updated?