Home > Enterprise >  EMR EKS unable to launch driver pod
EMR EKS unable to launch driver pod

Time:12-23

How does one go about setting resource limits in EMR on EKS? My driver pod is failing to launch because it is requesting more CPU than it is allowed. This doesn't make sense to me. I am running the getting started code from the docs below.

I have added --conf spark.driver.limit.cores=2 in order to try and make the limit higher than what is listed in the error message below. I got this idea from here https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties

This cluster does have istio running in it. I am not sure if that would cause issues.

Here is the code I am running to trigger the job

aws emr-containers start-job-run \
  --virtual-cluster-id blahblah \
  --name pi-4 \
  --execution-role-arn arn:aws:iam::0000000000:role/blahblah_emr_eks_executor_role \
  --release-label emr-6.4.0-latest \
  --job-driver '{
    "sparkSubmitJobDriver": {
      "entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py",
      "entryPointArguments": ["s3://blahblah/wordcount_output"],
      "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1 --conf spark.driver.limit.cores=2"
    }
  }'

This causes the job-runner container to fail with the following:


State: Terminated Reason: Error Message: Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/spark/pods. Message: Pod "spark-00000002vepbpmi2hkv-driver" is invalid: spec.containers[2].resources.requests: Invalid value: "1": must be less than or equal to cpu limit. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.containers[2].resources.requests, message=Invalid value: "1": must be less than or equal to cpu limit, reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, name=spark-00000002vepbpmi2hkv-driver, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Pod "spark-00000002vepbpmi2hkv-driver" is invalid: spec.containers[2].resources.requests: Invalid value: "1": must be less than or equal to cpu limit, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}). at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:589)


Any ideas on how to proceed?

CodePudding user response:

I was able to figure it out.

aws emr-containers start-job-run \
  --virtual-cluster-id=blahblah \
  --name=pi-4 \
  --execution-role-arn=arn:aws:iam::blahblahaccount:role/balblah_role_name \
  --release-label=emr-6.4.0-latest \
  --job-driver='{
    "sparkSubmitJobDriver": {
      "entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py",
      "sparkSubmitParameters": "--conf spark.executor.instances=1 --conf spark.executor.memory=2G --conf spark.executor.request.cores=1 --conf spark.kubernetes.executor.limit.cores=2 --conf spark.driver.request.cores=1 --conf spark.kubernetes.driver.limit.cores=2
    }
  }'

It seems that the aws docs are wrong and that the configuration values are actually as follows.

  • --conf spark.{driver|executor}.request.cores
  • --conf spark.{driver|executor}.limit.cores

However, the AWS docs has you pass in --conf spark.driver.cores=1. This value didn't seem to be acknowledged which I believe caused my error. The spark configuration docs below mention that spark.driver.request.cores have precedence over spark.driver.cores, which I think makes sense as the value was recognized when I passed that it.

https://spark.apache.org/docs/latest/running-on-kubernetes.html#configuration

  • Related