When writing to hdfs, how do I overwrite only the necessary folders?

Question

So, I have this folder, let's call it /data.

And it has partitions in it, e.g.: /data/partition1, /data/partition2.

I read new data from kafka, and imagine I only need to update /data/partition2. I do:

dataFrame
    .write
    .mode(SaveMode.Overwrite)
    .partitionBy("date", "key")
    .option("header", "true")
    .format(format)
    .save("/data")

and it successfully updates /data/partition2, but /data/partition1 is gone... How can I force spark's SaveMode.Overwrite to not touch HDFS partitions that don't need to be updated?

Possible duplicate of [Overwrite specific partitions in spark dataframe write method](https://stackoverflow.com/questions/38487667/overwrite-specific-partitions-in-spark-dataframe-write-method) — 10465355, Jan 17 '19 at 17:31
@user10465355 did you try what is suggested in that link? It simply doesn't work. The partitions that are not yet in HDFS don't get written to it at all. So it updates already existing ones and doesn't create any new folders. — hey_you, Jan 17 '19 at 21:42

score 0 · Answer 1 · answered Jan 17 '19 at 20:08

0

You are using SaveMode.Overwrite which deletes previously existing directories. You should instead use SaveMode.Append

NOTE: The append operation is not without cost. When you call save using append mode, spark needs to ensure uniqueness of the file names so that it won't overwrite an existing file by accident. The more files you already have in the directory, the longer the save operation takes. If you are talking about a handful of files, then it's a very cost effective operation. But if you have many terabytes of data in thousands of files in the original directory (which was my case), you should use a different approach.

answered Jan 17 '19 at 20:08

nads

309
3
12

That will append, I don't want that. – hey_you Jan 17 '19 at 20:22
Sorry, I think I got your question all wrong. Let me have another attemp :) It seems like you want to write to a given directory if and only if the corresponding HDFS directory exists. If I get it right, then direct saving from spark is not the answer you're looking for. You should first save into a different location, move the directories you want, and toss the rest. – nads Jan 17 '19 at 20:33
It doesnt necessarily have to exist. – hey_you Jan 17 '19 at 23:44
Imagine I have new partitioned data. Some of the partitions already exist on hdfs, others don't. So, the ones that exist I want to overwrite, the ones that don't I want to just write. And the ones that are not part of the new partitions I want to leave untouched. – hey_you Jan 17 '19 at 23:45
There is something that I still don't understand. Could you please provide a concrete example? The way you are writing creates directories such as `/data/date=20190117/key=abcd` and I fail to see the relation between the output of the _write()_ function and directories such as `/data/partition1` that you mentioned in your question. – nads Jan 18 '19 at 01:26
Yes, the output is like this `/data/date=20190117/key=abcd`, think of it as `partition1=/date=20190117/key`. It doesn't really matter – hey_you Jan 18 '19 at 08:10
https://stackoverflow.com/questions/54246038/how-do-i-upsert-into-hdfs-with-spark another question I posted – hey_you Jan 18 '19 at 08:11
Got it! I think you need to filter your data frame into two smaller subsets. There are three groups of partitions that you have (exists, doesn' exist, ignore) Use `SaveMode.Overwrite` for the first two and drop the last one. Prior to writing the results, you need to know what partitions already exist. That could be a simple ls command. Here's a pseudo code: `overwriteDF = df.where(col(date).isIn(partition1) and col(key).isIn(partition1) ignoreDF = df.where(col(date).isIn(partition2) and col(key).isIn(partition2))` – nads Jan 18 '19 at 22:27

When writing to hdfs, how do I overwrite only the necessary folders?

1 Answers1