How can I force EKS to replace an EKS's Managed Group Node?-CodePudding

After upgrading my EKS cluster to 1.22 (first upgrading control-plane, then the nodes), the update of the managed group nodes finished successfully, but after some time, EKS decided to provision a new node, and for whatever reason with the old kubernetes version.

If I run kubectl get nodes you can see that one of the nodes is running with 1.21:

NAME                                              STATUS   ROLES    AGE     VERSION
ip-10-13-10-186.ap-northeast-1.compute.internal   Ready    <none>   4h39m   v1.22.12-eks-ba74326
ip-10-13-26-91.ap-northeast-1.compute.internal    Ready    <none>   3h3m    v1.21.14-eks-ba74326
ip-10-13-40-42.ap-northeast-1.compute.internal    Ready    <none>   4h33m   v1.22.12-eks-ba74326

If I run and check my managed group, I see that it's actually version 1.22:
eksctl get nodegroup default-20220901053307980400000010 --cluster mycluster-dev -o yaml

- AutoScalingGroupName: eks-default-20220901053307980400000010-f8c17xc4-f750-d608-d166-113925c1g9c5
  Cluster: mycluster-dev
  CreationTime: "2022-09-01T05:33:11.484Z"
  DesiredCapacity: 3
  ImageID: AL2_x86_64
  InstanceType: t3.large
  MaxSize: 3
  MinSize: 3
  Name: default-20220901053307980400000010
  NodeInstanceRoleARN: arn:aws:iam::XXXXXXXXXXX:role/eks_node_group_dev_role-20220901052135822100000004
  StackName: ""
  Status: ACTIVE
  Type: managed
  Version: "1.22"

I can also see in AWS Console that the version is 1.22.
I tried running the upgrade command again with no avail.
I also tried with previously deployed cluster to manually delete a node from my managed group, but instead of eks redeploying a new node instead, I'm somehow just left with 2 nodes.

My question is, how I can force the replacement of this node in the hope that it will be launched with the correct kubelet version?

CodePudding user response：

Sincerely, sometimes I experience some non-logical behavior by EKS and I don't wonder about your case.

In my opinion, you should apply the change at the Autoscaling Group level, just locate the autoscaling group associated with your target node group and then remove the node from it manually, by detaching it first, and later terminating it. Changes applied to the node-group sometimes take some time to get reflected at the autoscaling group level, as per my experience, and I believe that sometimes, the interaction between both gets affected by such behavior.

Before detaching the node, make sure that the minimum and the desired capacity are set to 3, and when detaching it AWS will ask you if you want to replace the node, then say Yes.

Check the Launch Template version

As I didn't have it clear if you set the instance type via the node group or via the launch template attached to it, then as you confirmed in the comment it's (Launch Template), in this case, you should verify that you're always using the latest version of it to get the related changes reflected.