I have this table
CREATE TABLE `db`.`customer_history` (
`name` STRING,
`addrress` STRING,
`filename` STRING,
`dt` DATE)
USING delta
PARTITIONED BY (dt)
When I use this to load a partition data to the table
df
.write
.partitionBy("dt")
.mode("overwrite")
.format("delta")
.saveAsTable("db.customer_history")
For some reason, it overwrites the entire table. I thought the overwrite mode only overwrites the partition data (if it exists). Is my understanding correct?
CodePudding user response:
In order to overwrite a single partition, use:
df
.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", "dt >= '2021-01-01'")
.save("data_path")
CodePudding user response:
Delta makes it easy to update certain disk partitions with the replaceWhere
option. You can selectively overwrite only the data that matches predicates over partition columns as like this ,
dataset.write.repartition(1)\
.format("delta")\
.mode("overwrite")\
.partitionBy('Year','Week')\
.option("replaceWhere", "Year == '2019' AND Week >='01' AND Week <='02'")\ #to avoid overwriting Week3
.save("\curataed\dataset")
Note : replaceWhere is particularly useful when you have to run a computationally expensive algorithm, but only on certain partitions'
You can ref : link