How do I resolve PodEvictionFailure error in AWS EKS?-CodePudding

I am trying to upgrade my node group in AWS EKS. I am using CDK and I am getting the following error

Resource handler returned message: "[ErrorDetail(ErrorCode=PodEvictionFailure, ErrorMessage=Reached max retries while trying to evict pods from nodes in node group <node-group-name>, ResourceIds=[<node-name>])] (Service: null, Status Code: 0, Request ID: null)" (RequestToken: <request-token>, HandlerErrorCode: GeneralServiceException)

According to aws doc, PodEvictionFailure can occur if the deployment tolerates every taint, and the node can never become empty.

https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html#managed-node-update-upgrade

Deployment tolerating all the taints – Once every pod is evicted, it's expected for the node to be empty because the node is tainted in the earlier steps. However, if the deployment tolerates every taint, then the node is more likely to be non-empty, leading to pod eviction failure.

I checked my nodes and all the pods running on the node and found the following pods which tolerates every taint.

both of the following pods have the following tolerations.

Pod: kube-system/aws-node-pdmbh
Pod: kube-system/kube-proxy-7n2kf

{
  ...
  ...

  "tolerations": [
    {
      "operator": "Exists"
    },
    {
      "key": "node.kubernetes.io/not-ready",
      "operator": "Exists",
      "effect": "NoExecute"
    },
    {
      "key": "node.kubernetes.io/unreachable",
      "operator": "Exists",
      "effect": "NoExecute"
    },
    {
      "key": "node.kubernetes.io/disk-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/memory-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/pid-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/unschedulable",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/network-unavailable",
      "operator": "Exists",
      "effect": "NoSchedule"
    }
  ]
}

Do I need to change the tolerations of these pods to avoid tolerating all taints? If so, how, as these are pods managed by AWS.

How can I avoid PodEvictionFailure?

CodePudding user response：

As suggested by @Ola Ekdahl, also in Amazon AWS doc you shared - it's better to use force flag rather than change the tolerations for the pods. See: https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html ("Upgrade phase" #2)

You can add the force flag like following and see if that helps:

new eks.Nodegroup(this, 'myNodeGroup', {
  cluster: this.cluster,
  forceUpdate: true,
  releaseVersion: '<AMI ID obtained from changelog>',
  ...
});