repartition()
redistribute the data among different nodes with new partitions and coalesce
does the same thing but it keeps the some of original partitions without shuffling and add others partitions within those.
Why its always a say in spark that, equally partitioned data will be processed faster. Any reason why is it so and why it wont in case of not evenly distributed datasets ?
what's stopping not evenly distributed datasets
to process faster ?
Any ideas ?
CodePudding user response:
A 'partition' of data is processed by a 'task' as part of a 'stage'. A stage has many tasks that run in parallel. A Spark 'app' consists of multiple stages. The next stage can only start when the prior stage has completed.
A large partition has more data to process and takes longer simply therefore. Resources are in some cases held longer exclusively (un)necessarily.