Home > Back-end >  databricks overwriting entire table instead of adding new partition
databricks overwriting entire table instead of adding new partition

Time:12-08

I have this table

CREATE TABLE `db`.`customer_history` (
  `name` STRING,
  `addrress` STRING, 
  `filename` STRING,
  `dt` DATE)
USING delta
PARTITIONED BY (dt)

When I use this to load a partition data to the table

      df
    .write
    .partitionBy("dt")
    .mode("overwrite")
    .format("delta")
    .saveAsTable("db.customer_history")

For some reason, it overwrites the entire table. I thought the overwrite mode only overwrites the partition data (if it exists). Is my understanding correct?

CodePudding user response:

In order to overwrite a single partition, use:

df
  .write 
  .format("delta") 
  .mode("overwrite") 
  .option("replaceWhere", "dt >= '2021-01-01'") 
  .save("data_path")

CodePudding user response:

Delta makes it easy to update certain disk partitions with the replaceWhere option. You can selectively overwrite only the data that matches predicates over partition columns as like this ,

dataset.write.repartition(1)\
       .format("delta")\
       .mode("overwrite")\
       .partitionBy('Year','Week')\
       .option("replaceWhere", "Year == '2019' AND Week >='01' AND Week <='02'")\ #to avoid overwriting Week3
       .save("\curataed\dataset")

Note : replaceWhere is particularly useful when you have to run a computationally expensive algorithm, but only on certain partitions'

You can ref : link

  • Related