83

I want to overwrite specific partitions instead of all in spark. I am trying the following command:

df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')

where df is dataframe having the incremental data to be overwritten.

hdfs-base-path contains the master data.

When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.

What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?

Prasad Khode
  • 5,897
  • 11
  • 38
  • 52
yatin
  • 833
  • 1
  • 7
  • 7

13 Answers13

152

Finally! This is now a feature in Spark 2.3.0: SPARK-20236

To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.

Naman
  • 23,555
  • 22
  • 173
  • 290
Madhava Carrillo
  • 3,190
  • 3
  • 14
  • 20
  • 2
    It was hard for me to find the setting to use this, so leaving here the reference: https://stackoverflow.com/questions/50006526/overwrite-only-some-partitions-in-a-partitioned-spark-dataset – Madhava Carrillo May 02 '18 at 08:37
  • 1
    Can you please edit the answer to show example code from the JIRA? – OneCricketeer Jul 18 '18 at 12:51
  • Doesn't work. The new data that is not yet in HDFS is not written to it. – hey_you Jan 17 '19 at 21:45
  • Please take a look here https://stackoverflow.com/questions/54246038/what-is-the-fastest-way-to-property-update-hdfs-data-with-spark maybe you can help me – hey_you Jan 18 '19 at 00:12
  • 1
    If I'm **overwriting a single partition** and *I know name of that partition apriori*, is there a way to specify that to `spark` like we can do [in `Hive`](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML)? I'm asking this because that would give me a lot of assurance and kind of work as sanity check, plus I believe there would be some performance benefit too (since runtime resolution of partition for every record wouldn't be required) – y2k-shubham Feb 13 '19 at 07:05
  • 2
    @y2k-shubham yes, use `spark.sql('insert overwrite table TABLE_NAME partition(PARTITION_NAME=PARTITION_VALUE) YOUR SELECT STATEMENT)` This works for 2.2 at least, not suere if earlier versions support this. – Tetlanesh Feb 28 '19 at 14:31
  • where do you specify the partitions? – enneppi Mar 20 '19 at 13:45
  • Works good for me. I do not use hive table: someDf.write.format("orc").mode(SaveMode.Overwrite).partitionBy("partitioned_column").save(path_to_write_orc) – Miroslaw Dec 05 '19 at 11:08
58

This is a common problem. The only solution with Spark up to 2.0 is to write directly into the partition directory, e.g.,

df.write.mode(SaveMode.Overwrite).save("/root/path/to/data/partition_col=value")

If you are using Spark prior to 2.0, you'll need to stop Spark from emitting metadata files (because they will break automatic partition discovery) using:

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

If you are using Spark prior to 1.6.2, you will also need to delete the _SUCCESS file in /root/path/to/data/partition_col=value or its presence will break automatic partition discovery. (I strongly recommend using 1.6.2 or later.)

You can get a few more details about how to manage large partitioned tables from my Spark Summit talk on Bulletproof Jobs.

Sim
  • 11,689
  • 7
  • 57
  • 85
  • 1
    Thanks a lot Sim for answering. Just few doubts more, if suppose initial dataframe has data for around 100 partitions, then do I have to split this dataframe into another 100 dataframes with the respective partition value and insert directly into the partition directory. Can saving these 100 partitions be done in parallel? Also I am using Spark 1.6.1 If I am using orc file format, how can I stop emitting metadata files for that, is it same which you have mentioned for parquet? – yatin Jul 26 '16 at 16:12
  • Re: metadata, no, ORC is a different format and I don't think it produces non-data files. With 1.6.1 you need only ORC files in the subdirectories of the partition tree. You'll therefore have to delete `_SUCCESS` by hand. You can write in parallel to more than one partition but not from the same job. Start multiple jobs based on your platform capabilities, e.g., using REST API. – Sim Jul 26 '16 at 22:57
  • 7
    Any update about that? Does saveToTable() will overwrite just specific partitions? Does spark smart enough to figure out which partitions were overwritten? – David H Oct 21 '16 at 14:17
17
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF().write.mode("overwrite").format("parquet").partitionBy("date", "name").save("s3://path/to/somewhere")

This works for me on AWS Glue ETL jobs (Glue 1.0 - Spark 2.4 - Python 2)

Zach
  • 311
  • 4
  • 6
  • How does this approach behave with the job bookmark? Say you have an existing partition (e.g. day) which only has the first 12 hours of data for the day, and new files have arrived in your source that are for the second 12 hours that should be added to the partition, I worry that the Glue job bookmark is pretty naive and it will end up only writing data from the new files for that second 12 hours. Or do you not use the job bookmark? – Davos Aug 26 '19 at 06:50
  • 1
    Great question! I had exactly the same concern. My use case is that I specifically ask Glue to re-process certain partitions and re-write the results (using the above two lines). With the job bookmark enabled, it refuses to re-process the "old" data. – Zach Aug 26 '19 at 14:37
  • So you don't use the bookmark? That was pretty much the only reason I could see for bothering with the glueContext over just sticking with Spark. I don't want to manage the processed state, but I'm finding the bookmark is flaky, relying on file modified timestamps and no way to sync it apart from a brutal reset. Why Python 2 and not 3? – Davos Aug 26 '19 at 14:55
  • 1
    Yeah, the job bookmark has been bothering me for a while now. It's good for some low profile day-to-day work. But once you have a little bit "off-road" actions, that thing is less than useless. Regarding the Python version, when upgrading from Glue 0.9, looking at the two options (Python 2 vs 3), I just didn't want to break anything since the code was written in Python 2 era ^_^ – Zach Aug 26 '19 at 18:48
  • "less than useless", noted. Apart from `print is a function`, `unicode done properly` and the `literal long not necessary` there's not much in going 2->3. The Pyspark DSL syntax seems identical. Python 2 is officially unsupported in 2020, time to abandon it. – Davos Aug 27 '19 at 00:45
  • Thanks, it seems to work. I am also updating the partitions in Athena and it seems to work well. Thanks! – Camilo Velasquez Nov 11 '20 at 20:27
9

Adding 'overwrite=True' parameter in the insertInto statement solves this:

hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

df.write.mode("overwrite").insertInto("database_name.partioned_table", overwrite=True)

By default overwrite=False. Changing it to True allows us to overwrite specific partitions contained in df and in the partioned_table. This helps us avoid overwriting the entire contents of the partioned_table with df.

Cena
  • 2,831
  • 1
  • 13
  • 26
8

Using Spark 1.6...

The HiveContext can simplify this process greatly. The key is that you must create the table in Hive first using a CREATE EXTERNAL TABLE statement with partitioning defined. For example:

# Hive SQL
CREATE EXTERNAL TABLE test
(name STRING)
PARTITIONED BY
(age INT)
STORED AS PARQUET
LOCATION 'hdfs:///tmp/tables/test'

From here, let's say you have a Dataframe with new records in it for a specific partition (or multiple partitions). You can use a HiveContext SQL statement to perform an INSERT OVERWRITE using this Dataframe, which will overwrite the table for only the partitions contained in the Dataframe:

# PySpark
hiveContext = HiveContext(sc)
update_dataframe.registerTempTable('update_dataframe')

hiveContext.sql("""INSERT OVERWRITE TABLE test PARTITION (age)
                   SELECT name, age
                   FROM update_dataframe""")

Note: update_dataframe in this example has a schema that matches that of the target test table.

One easy mistake to make with this approach is to skip the CREATE EXTERNAL TABLE step in Hive and just make the table using the Dataframe API's write methods. For Parquet-based tables in particular, the table will not be defined appropriately to support Hive's INSERT OVERWRITE... PARTITION function.

Hope this helps.

vertigokidd
  • 183
  • 2
  • 8
  • I tried the above approach, i'm getting the error like `Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict` – Shankar Dec 18 '17 at 13:09
  • i dont have any static partition columns – Shankar Dec 18 '17 at 13:10
6

Tested this on Spark 2.3.1 with Scala. Most of the answers above are writing to a Hive table. However, I wanted to write directly to disk, which has an external hive table on top of this folder.

First the required configuration

val sparkSession: SparkSession = SparkSession
      .builder
      .enableHiveSupport()
      .config("spark.sql.sources.partitionOverwriteMode", "dynamic") // Required for overwriting ONLY the required partitioned folders, and not the entire root folder
      .appName("spark_write_to_dynamic_partition_folders")

Usage here:

DataFrame
.write
.format("<required file format>")
.partitionBy("<partitioned column name>")
.mode(SaveMode.Overwrite) // This is required.
.save(s"<path_to_root_folder>")
Skandy
  • 123
  • 1
  • 3
  • 8
3

I tried below approach to overwrite particular partition in HIVE table.

### load Data and check records
    raw_df = spark.table("test.original")
    raw_df.count()

lets say this table is partitioned based on column : **c_birth_year** and we would like to update the partition for year less than 1925


### Check data in few partitions.
    sample = raw_df.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag")
    print "Number of records: ", sample.count()
    sample.show()


### Back-up the partitions before deletion
    raw_df.filter(col("c_birth_year") <= 1925).write.saveAsTable("test.original_bkp", mode = "overwrite")


### UDF : To delete particular partition.
    def delete_part(table, part):
        qry = "ALTER TABLE " + table + " DROP IF EXISTS PARTITION (c_birth_year = " + str(part) + ")"
        spark.sql(qry)


### Delete partitions
    part_df = raw_df.filter(col("c_birth_year") <= 1925).select("c_birth_year").distinct()
    part_list = part_df.rdd.map(lambda x : x[0]).collect()

    table = "test.original"
    for p in part_list:
        delete_part(table, p)


### Do the required Changes to the columns in partitions
    df = spark.table("test.original_bkp")
    newdf = df.withColumn("c_preferred_cust_flag", lit("Y"))
    newdf.select("c_customer_sk", "c_preferred_cust_flag").show()


### Write the Partitions back to Original table
    newdf.write.insertInto("test.original")


### Verify data in Original table
    orginial.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag").show()



Hope it helps.

Regards,

Neeraj
neeraj bhadani
  • 2,301
  • 12
  • 21
2

As jatin Wrote you can delete paritions from hive and from path and then append data Since I was wasting too much time with it I added the following example for other spark users. I used Scala with spark 2.2.1

  import org.apache.hadoop.conf.Configuration
  import org.apache.hadoop.fs.Path
  import org.apache.spark.SparkConf
  import org.apache.spark.sql.{Column, DataFrame, SaveMode, SparkSession}

  case class DataExample(partition1: Int, partition2: String, someTest: String, id: Int)

 object StackOverflowExample extends App {
//Prepare spark & Data
val sparkConf = new SparkConf()
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val tableName = "my_table"

val partitions1 = List(1, 2)
val partitions2 = List("e1", "e2")
val partitionColumns = List("partition1", "partition2")
val myTablePath = "/tmp/some_example"

val someText = List("text1", "text2")
val ids = (0 until 5).toList

val listData = partitions1.flatMap(p1 => {
  partitions2.flatMap(p2 => {
    someText.flatMap(
      text => {
        ids.map(
          id => DataExample(p1, p2, text, id)
        )
      }
    )
  }
  )
})

val asDataFrame = spark.createDataFrame(listData)

//Delete path function
def deletePath(path: String, recursive: Boolean): Unit = {
  val p = new Path(path)
  val fs = p.getFileSystem(new Configuration())
  fs.delete(p, recursive)
}

def tableOverwrite(df: DataFrame, partitions: List[String], path: String): Unit = {
  if (spark.catalog.tableExists(tableName)) {
    //clean partitions
    val asColumns = partitions.map(c => new Column(c))
    val relevantPartitions = df.select(asColumns: _*).distinct().collect()
    val partitionToRemove = relevantPartitions.map(row => {
      val fields = row.schema.fields
      s"ALTER TABLE ${tableName} DROP IF EXISTS PARTITION " +
        s"${fields.map(field => s"${field.name}='${row.getAs(field.name)}'").mkString("(", ",", ")")} PURGE"
    })

    val cleanFolders = relevantPartitions.map(partition => {
      val fields = partition.schema.fields
      path + fields.map(f => s"${f.name}=${partition.getAs(f.name)}").mkString("/")
    })

    println(s"Going to clean ${partitionToRemove.size} partitions")
    partitionToRemove.foreach(partition => spark.sqlContext.sql(partition))
    cleanFolders.foreach(partition => deletePath(partition, true))
  }
  asDataFrame.write
    .options(Map("path" -> myTablePath))
    .mode(SaveMode.Append)
    .partitionBy(partitionColumns: _*)
    .saveAsTable(tableName)
}

//Now test
tableOverwrite(asDataFrame, partitionColumns, tableName)
spark.sqlContext.sql(s"select * from $tableName").show(1000)
tableOverwrite(asDataFrame, partitionColumns, tableName)

import spark.implicits._

val asLocalSet = spark.sqlContext.sql(s"select * from $tableName").as[DataExample].collect().toSet
if (asLocalSet == listData.toSet) {
  println("Overwrite is working !!!")
}

}

Ehud Lev
  • 1,569
  • 16
  • 24
1

If you use DataFrame, possibly you want to use Hive table over data. In this case you need just call method

df.write.mode(SaveMode.Overwrite).partitionBy("partition_col").insertInto(table_name)

It'll overwrite partitions that DataFrame contains.

There's not necessity to specify format (orc), because Spark will use Hive table format.

It works fine in Spark version 1.6

L. Viktor
  • 143
  • 1
  • 6
  • 1
    This remove the previous partitions if they are not in the current dataframe. – Carlos Verdes Apr 25 '18 at 13:59
  • How to update the data if table is partitioned based on multiple columns say year, month and I only want to overwrite based on year? – neeraj bhadani Jul 18 '18 at 13:01
  • Also I am getting error : AnalysisException: u"insertInto() can't be used together with partitionBy(). Partition columns have already be defined for the table. It is not necessary to use partitionBy().;" – neeraj bhadani Jul 18 '18 at 13:03
  • without partitionBy I am getting duplicate data inserted even with mode("overwrite") – neeraj bhadani Jul 18 '18 at 13:05
  • This is partially correct. See Surya Murali comment for the additional settings I needed to add in order for it to work. At least in my case worked (spark 1.6, scala) – Costin Aldea Sep 17 '20 at 15:17
1

Instead of writing to the target table directly, i would suggest you create a temporary table like the target table and insert your data there.

CREATE TABLE tmpTbl LIKE trgtTbl LOCATION '<tmpLocation';

Once the table is created, you would write your data to the tmpLocation

df.write.mode("overwrite").partitionBy("p_col").orc(tmpLocation)

Then you would recover the table partition paths by executing:

MSCK REPAIR TABLE tmpTbl;

Get the partition paths by querying the Hive metadata like:

SHOW PARTITONS tmpTbl;

Delete these partitions from the trgtTbl and move the directories from tmpTbl to trgtTbl

Joha
  • 737
  • 5
  • 25
0

I would suggest you doing clean-up and then writing new partitions with Append mode:

import scala.sys.process._
def deletePath(path: String): Unit = {
    s"hdfs dfs -rm -r -skipTrash $path".!
}

df.select(partitionColumn).distinct.collect().foreach(p => {
    val partition = p.getAs[String](partitionColumn)
    deletePath(s"$path/$partitionColumn=$partition")
})

df.write.partitionBy(partitionColumn).mode(SaveMode.Append).orc(path)

This will delete only new partitions. After writing data run this command if you need to update metastore:

sparkSession.sql(s"MSCK REPAIR TABLE $db.$table")

Note: deletePath assumes that hfds command is available on your system.

gorros
  • 1,239
  • 1
  • 13
  • 25
-1

You could do something like this to make the job reentrant (idempotent): (tried this on spark 2.2)

# drop the partition
drop_query = "ALTER TABLE table_name DROP IF EXISTS PARTITION (partition_col='{val}')".format(val=target_partition)
print drop_query
spark.sql(drop_query)

# delete directory
dbutils.fs.rm(<partition_directoy>,recurse=True)

# Load the partition
df.write\
  .partitionBy("partition_col")\
  .saveAsTable(table_name, format = "parquet", mode = "append", path = <path to parquet>)
jatin
  • 1
  • Why Python 2? Also this looks like Databricks specific, good to mention that for others not using that platform. I like idempotent but is this really? What if deleting the directory is successful but the append is not? How do you guarantee the df contains the deleted partition's data? – Davos Aug 26 '19 at 07:02
-1

For >= Spark 2.3.0 :

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.insertInto("partitioned_table", overwrite=True)
Allan Feliph
  • 622
  • 1
  • 6
  • 8