After upgrading my EKS cluster to 1.22 (first upgrading control-plane, then the nodes), the update of the managed group nodes finished successfully, but after some time, EKS decided to provision a new node, and for whatever reason with the old kubernetes version.
If I run kubectl get nodes
you can see that one of the nodes is running with 1.21:
NAME STATUS ROLES AGE VERSION
ip-10-13-10-186.ap-northeast-1.compute.internal Ready <none> 4h39m v1.22.12-eks-ba74326
ip-10-13-26-91.ap-northeast-1.compute.internal Ready <none> 3h3m v1.21.14-eks-ba74326
ip-10-13-40-42.ap-northeast-1.compute.internal Ready <none> 4h33m v1.22.12-eks-ba74326
If I run and check my managed group, I see that it's actually version 1.22:
eksctl get nodegroup default-20220901053307980400000010 --cluster mycluster-dev -o yaml
- AutoScalingGroupName: eks-default-20220901053307980400000010-f8c17xc4-f750-d608-d166-113925c1g9c5
Cluster: mycluster-dev
CreationTime: "2022-09-01T05:33:11.484Z"
DesiredCapacity: 3
ImageID: AL2_x86_64
InstanceType: t3.large
MaxSize: 3
MinSize: 3
Name: default-20220901053307980400000010
NodeInstanceRoleARN: arn:aws:iam::XXXXXXXXXXX:role/eks_node_group_dev_role-20220901052135822100000004
StackName: ""
Status: ACTIVE
Type: managed
Version: "1.22"
I can also see in AWS Console that the version is 1.22.
I tried running the upgrade command again with no avail.
I also tried with previously deployed cluster to manually delete a node from my managed group, but instead of eks redeploying a new node instead, I'm somehow just left with 2 nodes.
My question is, how I can force the replacement of this node in the hope that it will be launched with the correct kubelet version?
CodePudding user response:
Sincerely, sometimes I experience some non-logical behavior by EKS and I don't wonder about your case.
In my opinion, you should apply the change at the Autoscaling Group level, just locate the autoscaling group associated with your target node group and then remove the node from it manually, by detaching it first, and later terminating it. Changes applied to the node-group sometimes take some time to get reflected at the autoscaling group level, as per my experience, and I believe that sometimes, the interaction between both gets affected by such behavior.
Before detaching the node, make sure that the minimum and the desired capacity are set to 3, and when detaching it AWS will ask you if you want to replace the node, then say Yes.
Check the Launch Template version
As I didn't have it clear if you set the instance type via the node group or via the launch template attached to it, then as you confirmed in the comment it's (Launch Template), in this case, you should verify that you're always using the latest version of it to get the related changes reflected.