I was going through Hadoop and Spark documentation to understand how spark works on Hadoop cluster. According to Hadoop documentation, Hadoop cluster is a group of commodity hardware with computation and data storage capacity, and also they assume "Moving computation is cheaper than moving data".
Now, when I process a large file which is stored on HDFS using Spark. Will Spark redistribute data in this file to Hadoop cluster randomly or it is aware of the nodes where data partitions are stored will ask the respective nodes to process their data? I got this question as there was no mention regarding how spark handles data partitions on Hadoop cluster.
And If spark redistributes the data, what is the logic behind having this overhead of redistribution?
CodePudding user response:
TLDR: No Spark does not move data (in HDFS) to complete calculations
Spark does try to allocate containers to the nodes where the data is located. (This is known as data locality, data being processed on the same node that the data is located.) If those nodes are busy it will have to allocate containers on other nodes and ship the data to the node.(Via a network shuffle) The data that gets moved to other nodes is "intermediate files on disk" (and not permanently moved, and will be cleaned up over time.) These intermediate files turn out to be handy as they help can act as an internal RDD cache.