I have a dataset for which I'd like to run multiple jobs for in parallel. I do this by launching each action in its own thread to get multiple Spark jobs per Spark application like the docs say.
Now the task I'm running doesn't benefit endlessly from throwing more cores at it - at like 50 cores or so the gain of adding more resources is quite minimal. So for example if I have 2 jobs and 100 cores I'd like to run both jobs in parallel each of them only occupying 50 cores at max to get faster results.
One thing I could probably do is to set the amount of partitions to 50 so the jobs could only spawn 50 tasks(?). But apparently there are some performance benefits of having more partitions than available cores to get a better overall utilization.
But other than that I didn't spot anything useful in the docs to limit the resources per Apache Spark job inside one application. (I'd like to avoid spawning multiple applications to split up the executors).
Is there any good way to do this?
CodePudding user response:
You can limit the amount of resources a spark application uses by using the spark.cores.max
configuration parameter.
From the docs:
When running on a standalone deploy cluster or a Mesos cluster in "coarse-grained" sharing mode, the maximum amount of CPU cores to request for the application from across the cluster (not from each machine). If not set, the default will be spark.deploy.defaultCores on Spark's standalone cluster manager, or infinite (all available cores) on Mesos.
CodePudding user response:
Perhaps asking Spark driver to use fair scheduling is the most appropriate solution in your case.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
There is also a concept of pools, but I've not used them, perhaps that gives you some more flexibility on top of fair scheduling.
Seems like conflicting requirements with no silver bullet.
- parallelize as much as possible.
- limit any one job from hogging resources IF (and only if) another job is running as well.
So:
- if you increase number of partitions then you'll address #1 but not #2.
- if you specify
spark.cores.max
then you'll address #2 but not #1. - if you do both (more partitions and limit
spark.cores.max
) then you'll address #2 but not #1.
If you only increase number of partitions then only thing you're risking is that a long running big job will delay the completion/execution of some smaller jobs, though overall it'll take the same amount of time to run two jobs on given hardware in any order as long as you're not restricting concurrency (spark.cores.max
).
In general I would stay away from restricting concurrency (spark.cores.max
).
Bottom line, IMO
- don't touch
spark.cores.max
. - increase partitions if you're not using all your cores.
- use fair scheduling
- if you have strict latency/response-time requirements then use separate auto-scaling clusters for long running and short running jobs