I have been reading about RDD's a lot but something that I don't quite understand how is RDD distributed when there is no replication in Apache spark?
The post[1] says that
To achieve Spark fault tolerance for all the RDDs, the entire data is copied across multiple nodes in the cluster.
As per my understanding, If this this the case then there should be data replication, but most articles says the DAG is the way spark achieves fault tolerance.
Can someone explain a bit detail on this?
CodePudding user response:
Your assumption on replication is wrong or correct, depending on your perspective.
Spark replicates nothing, as it tends to want to work in-memory. If data is persisted, then if on HDFS or S3, those products will replicate.
There seems to be some copying and pasting on the Internet going on where Spark fault tolerance is concerned. The 'misinformation' is being copied therefore.
RDD lineage or checkpointing help in restoring data that needs to be re-computed from the start or from a location on disk.
CodePudding user response:
Data replication is the process of creating multiple copies of the same data. There is no data replication as you see in other systems like Kafka, Pinot etc since Spark is a data processing engine instead of a data store. That being said, when a data is read, its split into smaller units and stored in each node and further transformations are applied on this. Hence the term distributed.
How spark achieves fault tolerance is through lineage graphs. These graphs keeps track of transformations to be executed on an RDD after an action is called. Lineage Graph helps recompute damaged RDDs.