Home > Blockchain >  How do the yarn and spark parameters interplay together?
How do the yarn and spark parameters interplay together?

Time:11-30

There are parameters that decide the maximum, minimum and total of the memory and cpu that yarn can allocate via containers

example:

yarn.nodemanager.resource.memory-mb

yarn.scheduler.maximum-allocation-mb

yarn.scheduler.minimum-allocation-mb

yarn.nodemanager.resource.cpu-vcores

yarn.scheduler.maximum-allocation-vcores

yarn.scheduler.minimum-allocation-vcores

There are also spark side parameters that seemingly would control similar kind of allocations:

spark.executor.instances

spark.executor.memory

spark.executor.cores

etc

What happens when the two set of parameters are infeasible according to the bounds set by the other. For example: What if yarn.scheduler.maximum-allocation-mb is set to 1G and the spark.executor.memory is set to 2G? Similar conflicts and infeasibilities could be imagined for the other parameters as well.

What happens in such cases? And, what is the suggested way to set these parameters?

CodePudding user response:

When running Spark on YARN, each Spark executor runs as a YARN container

So take spark.executor.memory as an example:

  • If spark.executor.memory is 2G and yarn.scheduler.maximum-allocation-mb is 1G, then your container will be OOM killer
  • If spark.executor.memory is 2G and yarn.scheduler.minimum-allocation-mb is 4G, then your container is much bigger than needed by the Spark application

Suggestions for setting parameters depend on your hardware resources and other services running on this machine. You can try to use the default value first, and then make adjustments by monitoring machine resources

CodePudding user response:

This excellent https://community.cloudera.com/t5/Support-Questions/Yarn-container-size-flexible-to-satisfy-what-application-ask/m-p/115458 and Difference between `yarn.scheduler.maximum-allocation-mb` and `yarn.nodemanager.resource.memory-mb`? should give you the basics. Additionally, here is a good SO-related answer Spark on YARN resource manager: Relation between YARN Containers and Spark Executors.

TL;DR

  • As you are not talking about Kubernetes, then YARN as Resource / Cluster Mgr allocates Executors with needed resouces, based on Spark params / defaults that are allocated based on those YARN params for the Containers.

  • 1 Container = 1 Executor. Some state incorrectly, 1 Container N Executors, not so.

  • There is minimum allocation and max allocation of resources, based on those YARN params. So,YARN will provide Executors with some wastage of resources, if it can - or of restricted size.

  • If non-dynamic YARN resource allocation, then Apps start with less resources, else there will be a wait to get all resources, and those acquired are not available for others.

  • There is also a fair scheduler for more smooth, uniform throughput for many concurrent apps.

  • Related