I have a table that ingests new data every day through a merge. I'm currently trying to migrate from ORC
to Delta
file format and I stumbled through a problem when processing the following simple Merge
operation:
DeltaTable
.forPath(sparkSession, deltaTablePath)
.alias(SOURCE)
.merge(
rawDf.alias(RAW),
joinClause // using primary keys and when possible partition keys. Not important here I guess
)
.whenNotMatched().insertAll()
.whenMatched(dedupConditionUpdate).updateAll()
.whenMatched(dedupConditionDelete).delete()
.execute()
When the merge is done, every impacted partition has hundreds of new files. As there is one new ingestion per day, this behaviour makes every following merge operation slower and slower.
Versions:
- Spark : 2.4
- DeltaLake: 0.6.1
Is there a way to repartition before saving ? or any other way to improve this ?
CodePudding user response:
you should set delta.autoOptimize.autoCompact
property on table for auto compaction.
following page shows how you can set at for existing and new table.
https://docs.databricks.com/delta/optimizations/auto-optimize.html
CodePudding user response:
After searching a bit in Delta core's code, there is an option that does repartition on write :
spark.conf.set("spark.databricks.delta.merge.repartitionBeforeWrite.enabled", "true")