I am working on Apache Spark standalone cluster with 2 executors, each having 1g heap space and 8 cores each.
I load input file having size 2.7Gb into a dataframe df. This was successfully done using 21 tasks, that is I used 21 partitions in total across my whole cluster.
Now I tried writing this out to csv using only 1 partition, so that I get all my records in 1 csv file.
df.coalesce(1).write.option("header","true").csv("output.csv")
I expected to get an OOM error since the total usable memory for an executor is less than 2.7Gb. But this did not happen.
How did my task not break despite the data being larger than a single partition? What exactly is happening here under the hood?
CodePudding user response:
The original csv file is of size 2.7GB in its raw format (text-based, no compression). When you read that file with Spark it splits up the data into multiple partitions based on the configuration spark.files.maxPartitionBytes
which defaults to 128MB. Doing the math leads to 2700MB / 128MB = 21 partitions
.