8

I have partitioned data in the HDFS. At some point I decide to update it. The algorithm is:

  • Read the new data from a kafka topic.
  • Find out new data's partition names.
  • Load the data from partitions with these names that is in the HDFS.
  • Merge the HDFS data with the new data.
  • Overwrite partitions that are already on disk.

The problem is that what if the new data has partitions that don't exist on disk yet. In that case they don't get written. https://stackoverflow.com/a/49691528/10681828 <- this solution doesn't write new partitions for example. enter image description here

The above picture describes the situation. Let's think of the left disk as being the partitions that are already in HDFS and of the right disk as partitions that we just received from Kafka.

Some of the partitions of the right disk will intersect with the already existing ones, the others won't. And this code:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
dataFrame
    .write
    .mode(SaveMode.Overwrite)
    .partitionBy("date", "key")
    .option("header", "true")
    .format(format)
    .save(path)

is not able to write the blue part of the picture to disk.

So, how do I resolve this issue? Please provide code. I am looking for something performant.

An example for those who don't understand:

Suppose we have this data in the HDFS:

  • PartitionA has data "1"
  • PartitionB has data "1"

Now we receive this new data:

  • PartitionB has data "2"
  • PartitionC has data "1"

So, partitions A and B are in the HDFS, and partitions B and C are the new ones, and since B is in the HDFS we update it. And I want C to be written. So the end result should look like this:

  • PartitionA has data "1"
  • PartitionB has data "2"
  • PartitionC has data "1"

But If I use the code from above, I get this:

  • PartitionA has data "1"
  • PartitionB has data "2"

Because the new feature overwrite dynamic from spark 2.3 is not able to create PartitionC.

Update: It turns out that if you use hive tables instead, this will work. But if you use pure spark it doesn't... So, I guess hive's overwrite and spark's overwrite work differently.

hey_you
  • 858
  • 1
  • 7
  • 24
  • I'm not sure if this setting the property just before using affects its functioning, but I just come across this problem and setting this property `"spark.sql.sources.partitionOverwriteMode": "dynamic"` worked for me. **I set it while creating the spark session, tho.** You may want to try that. – mrbolichi Jan 18 '19 at 14:53
  • That's flipping weird. I will test this later and report back. – hey_you Jan 18 '19 at 15:12
  • @mrbolichi like so: `SparkSession.builder() .config("spark.sql.sources.partitionOverwriteMode", "dynamic") .appName("Ingester") .getOrCreate()`? – hey_you Jan 18 '19 at 17:10
  • I just tried it, it did not work. – hey_you Jan 18 '19 at 17:16
  • Are you trying to `save` into a Hive table? I have reproduced this behavior only by inserting into a Hive table – mrbolichi Jan 18 '19 at 19:07
  • @mrbolichi no. This is the thing, I know that it works with hive tables. I want to get it to work without them. – hey_you Jan 18 '19 at 19:12
  • Are you trying to save parquet/csv/orc files and expecting this behavior to hold the same as Hive tables? – mrbolichi Jan 18 '19 at 19:27
  • @mrbolichi I don't understand what you are trying to say. I have a dataframe, and I use `DataFrameWriter`'s `save` method with `SaveMode.Overwrite` to write it to hdfs, however not all of its partitions are written to disk. What does it have to do with expecting this to work like a hive table. The 2 approaches are supposed to offer the same functionality. However they behave differently. And I am simply wondering how to make my approach overwrite in the same fashion as the hive approach would. – hey_you Jan 18 '19 at 19:30
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/186938/discussion-between-mrbolichi-and-hey-you). – mrbolichi Jan 18 '19 at 19:47
  • Works with Spark, directly via path to a folder. Answered here: [enter link description here](https://stackoverflow.com/a/56570869/2193715) – Skandy Jun 12 '19 at 22:20
  • @Skandy no it doesn't, it will remove the old contents from the root folder – hey_you Mar 19 '20 at 15:35

1 Answers1

1

In the end I just decided to delete that "green" subset of partitions from HDFS, and use SaveMode.Append instead. I think this is a bug in spark.

hey_you
  • 858
  • 1
  • 7
  • 24