I am using spark version 2.3 and trying to read hive table in spark as:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.table("emp.emptable")
here I am adding a new column with current date from system to the existing dataframe
import pyspark.sql.functions as F
newdf = df.withColumn('LOAD_DATE', F.current_date())
and now facing an issue,when I am trying to write this dataframe as hive table
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
pyspark.sql.utils.AnalysisException: u'Cannot overwrite table emp.emptable that is also being read from;'
so I am checkpointing the dataframe to break the lineage since I am reading and writing from same dataframe
checkpointDir = "/hdfs location/temp/tables/"
spark.sparkContext.setCheckpointDir(checkpointDir)
df = spark.table("emp.emptable").coalesce(1).checkpoint()
newdf = df.withColumn('LOAD_DATE', F.current_date())
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
This way it's working fine and new column has been added to the hive table. but I have to delete the checkpoint files every time it's get created. Is there any best way to break the lineage and write the same dataframe with updated column details and save it to hdfs location or as a hive table.
or is there any way to specify a temp location for checkpoint directory, which will get deleted post the spark session completes.