Deltalake - Merge is creating too many files per partition-CodePudding

I have a table that ingests new data every day through a merge. I'm currently trying to migrate from ORC to Delta file format and I stumbled through a problem when processing the following simple Merge operation:

DeltaTable
      .forPath(sparkSession, deltaTablePath)
      .alias(SOURCE)
      .merge(
        rawDf.alias(RAW),
        joinClause // using primary keys and when possible partition keys. Not important here I guess
      )
      .whenNotMatched().insertAll()
      .whenMatched(dedupConditionUpdate).updateAll()
      .whenMatched(dedupConditionDelete).delete()
      .execute()

When the merge is done, every impacted partition has hundreds of new files. As there is one new ingestion per day, this behaviour makes every following merge operation slower and slower.

Versions:

Spark : 2.4
DeltaLake: 0.6.1

Is there a way to repartition before saving ? or any other way to improve this ?

CodePudding user response：

you should set delta.autoOptimize.autoCompact property on table for auto compaction.

following page shows how you can set at for existing and new table.

https://docs.databricks.com/delta/optimizations/auto-optimize.html

CodePudding user response：

After searching a bit in Delta core's code, there is an option that does repartition on write :

spark.conf.set("spark.databricks.delta.merge.repartitionBeforeWrite.enabled", "true")