The difference between coalesce
and repartition
is fairly straightforward. If I were to coalesce a DataFrame to 1 partition and write it to a storage service (Azure Blob/ AWS S3 etc), would the entire DataFrame be sent to the driver and then to the storage service; or would an executor send it directly?
CodePudding user response:
The Spark official documentation describes it as follows:
If you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1).
From the above it can be inferred that it should be an executor send it directly.