I am trying to upgrade my node group in AWS EKS. I am using CDK and I am getting the following error
Resource handler returned message: "[ErrorDetail(ErrorCode=PodEvictionFailure, ErrorMessage=Reached max retries while trying to evict pods from nodes in node group <node-group-name>, ResourceIds=[<node-name>])] (Service: null, Status Code: 0, Request ID: null)" (RequestToken: <request-token>, HandlerErrorCode: GeneralServiceException)
According to aws doc, PodEvictionFailure
can occur if the deployment tolerates every taint, and the node can never become empty.
Deployment tolerating all the taints – Once every pod is evicted, it's expected for the node to be empty because the node is tainted in the earlier steps. However, if the deployment tolerates every taint, then the node is more likely to be non-empty, leading to pod eviction failure.
I checked my nodes and all the pods running on the node and found the following pods which tolerates every taint.
both of the following pods have the following tolerations.
- Pod: kube-system/aws-node-pdmbh
- Pod: kube-system/kube-proxy-7n2kf
{
...
...
"tolerations": [
{
"operator": "Exists"
},
{
"key": "node.kubernetes.io/not-ready",
"operator": "Exists",
"effect": "NoExecute"
},
{
"key": "node.kubernetes.io/unreachable",
"operator": "Exists",
"effect": "NoExecute"
},
{
"key": "node.kubernetes.io/disk-pressure",
"operator": "Exists",
"effect": "NoSchedule"
},
{
"key": "node.kubernetes.io/memory-pressure",
"operator": "Exists",
"effect": "NoSchedule"
},
{
"key": "node.kubernetes.io/pid-pressure",
"operator": "Exists",
"effect": "NoSchedule"
},
{
"key": "node.kubernetes.io/unschedulable",
"operator": "Exists",
"effect": "NoSchedule"
},
{
"key": "node.kubernetes.io/network-unavailable",
"operator": "Exists",
"effect": "NoSchedule"
}
]
}
Do I need to change the tolerations of these pods to avoid tolerating all taints? If so, how, as these are pods managed by AWS.
How can I avoid PodEvictionFailure
?
CodePudding user response:
As suggested by @Ola Ekdahl, also in Amazon AWS doc you shared - it's better to use force
flag rather than change the tolerations for the pods. See: https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html ("Upgrade phase" #2)
You can add the force
flag like following and see if that helps:
new eks.Nodegroup(this, 'myNodeGroup', {
cluster: this.cluster,
forceUpdate: true,
releaseVersion: '<AMI ID obtained from changelog>',
...
});