I am trying to read from a partitioned delta table, perform some narrow transformations, and write it into a new delta table which is partitioned on the same fields.
Table A (partitioned on col1, col2) -> Table B (partitioned on col1, col2)
Since the partitioning strategy is the same and there are no wide transformations, my assumption is that shuffle is not needed here
Do I need to specify some special options while reading or writing to ensure that the shuffle operation is not triggered for this?
I tried to read the data normally and write it back using df_B.write.partitionBy("col1", "col2")...
but the shuffle still seems to be the bottleneck
CodePudding user response:
I got the issue. I was seeing a shuffle happening because of the Delta Table property spark.databricks.delta.optimizeWrite.enabled
. This may not be needed now since the partition strategy of source and destination is the same now.