Home > Mobile >  Avoid shuffle while writing data into a partitioned table from another partitioned table
Avoid shuffle while writing data into a partitioned table from another partitioned table

Time:11-16

I am trying to read from a partitioned delta table, perform some narrow transformations, and write it into a new delta table which is partitioned on the same fields.

Table A (partitioned on col1, col2) -> Table B (partitioned on col1, col2)

Since the partitioning strategy is the same and there are no wide transformations, my assumption is that shuffle is not needed here

Do I need to specify some special options while reading or writing to ensure that the shuffle operation is not triggered for this?

I tried to read the data normally and write it back using df_B.write.partitionBy("col1", "col2")... but the shuffle still seems to be the bottleneck

CodePudding user response:

I got the issue. I was seeing a shuffle happening because of the Delta Table property spark.databricks.delta.optimizeWrite.enabled. This may not be needed now since the partition strategy of source and destination is the same now.

  • Related