Can someone explain to me what happens if we have much bigger data than the driver size? In this case how does Spark exactly work? If it caches the data in disk memory then how will it be "In-memory computing"? Any help will be appreciated.
CodePudding user response:
Only instruction comes from the driver. Data is stored and computed on the executors. Even if the data does not fit the driver, it should fit in the total available memory of the executors. As long as you do not perform a collect
(bring all the data from the executor to the driver) you should have no issue.
CodePudding user response:
@Steven has it answered, adding my 2 cents here.
A spark driver is a machine that obviously cannot hold TerraBytes worth of data. Let's say you have a file or multiple files of 10TB and you want Spark to perform some computation of this file. The driver machine will just create a Logical plan on how to perform computation on this file, it will come up with the best possible and optimized plan (with the help of tungsten and catalyst optimizer), the file is partitioned and each partition is read by executors that are in machines closer to where the blocks exist (as per data locality in HDFS).
So spark can handle data size much bigger than the driver since the read operation doesn't happen at the driver. Certain actions like collect can overwhelm the driver since it collects all the executor results to the driver before writing down the result.