What is the difference between working with clusters on spark and parallel operations on local?-CodePudding

I have been studying on big data for a while. And i use, actually trying to use PySpark:). But in some point i really confused. For example as i know spark depending on its RDD option making parallelization automatically. And so why do we use clusters except using this local parallelization? Or do we use cluster mode for really big data(I am not talking about deploy mode i only say 2 or 3 or 4 slaves)? Actually i imagine parallelization like this, for example my computer have 12 cores so i think these 12 cores are individual computers and so like i have 12 computers. So because this thought it seems unnecessary to me to use a cluster for example in emr one master node and 2 slave nodes. And when i have 2 slaves is a parallelization keep going on them too. For example like i have 2 slaves and each of them 12 cores like my computer and so do i have 24 cores in this situation? If it is complicated and the title is wrong or deficient i can edit. Thanks in advance.

CodePudding user response：

It's true that CPU's determine the unit of paralellization in spark.
Spark can process 1 task per CPU concurrently
So whether you have a single machine with 12 cores or 12 machines with 1 core each, you will be able to process 12 spark tasks at a time.

There are a few reasons why it's better to have multiple machines rather than one huge machine. But you probably won't notice it much until you start scaling quite a bit above 12 cores

1. Scalability

It's easier to scale horziontally (adding more machines) than it is to scale vertically (getting bigger machines).
Take your 12 core machine - say you happily use 12 cores for a year and then realise that your job has gotten much bigger and you now want to hit it with 24 cores. It's nice to be able to just buy another 12 core machine and string the 2 together rather than having to buy a full new 24 core machine.
This compounds as you scale upwards. If you have a 2000 core cluster and you want 10 more cores, you'd much rather add a single machine with 10 cores rather than buy a new 2010 core machine (if they even exist)
The same is true for other resources such as RAM

2. Cloud pricing

With cloud services such as EMR, you can run your job with some ultra reliable on-demand instances (expesive) along side some ultra cheap spot instances that can be taken offline at any time.
A common pattern is to have a master node and 2 core nodes on demand that will run no matter what. Then supplement this with say 20 core nodes from the spot market that have the potential to be taken offline

Counter argument

Incidentally there are also some reasons why more machines can mean more problems

Single node clusters are easier to manage
In theory it should be faster to shuffle data between partitions if all of your partitions are located on the same machine

If you're using 12 cores I think you would be much better off using a single node spark than trying to set up cluster of machines

CodePudding user response：

A computer is not just the number of cores it has. It also has other resources such as RAM and disk.

When working with big data, often times the amount of data is so large it can't fit on a single machine's RAM, which is why we use a cluster of machines so there's enough RAM in between them to fit the dataset in memory.

Additionally, if your data is replicated to all of these machines' disks, you can benefit from having multiple machines reading their subset of the data in parallel so you don't have to wait for disk I/O before beginning any computation. This also helps when persisting the results of the computation back to disk since each machine can write its data in parallel.

Finally, more machines indeed means more CPU which means more parallel computation than you would get from only one worker.