I'm upgrading some AKS clusters for an app and have been testing out the az aks nodepool upgrade
--max-surge
flag to speed up the process. Our prod environment has 50 nodes, and at the clocked speed per node I have seen on our lowers I estimate prod will take 9 hours to complete. On one of the lower upgrades I ran a max surge at 50% which did help a little bit on speed, and all deployments kept a minimum available pods of 50%.
For this latest upgrade I tried out a max surge of 100%. Which spun up 6 new nodes(6 current nodes in the pool) on the correct version....but then it migrated every deployment/pod at the same time and took everything down to 0/2 pods. Before I started this process I made sure to have a pod disruption budget for every single deployment set at min available of 50%. This has worked on all of my other upgrades, except this one, which to me means the 100% surge is the cause.
I just can't figure out why my minimum available percentage was ignored. Below are the descriptions of an example PDB, and the corresponding deployment.
Pod disruption budget:
Name: myapp-admin
Namespace: front-svc
Min available: 50%
Selector: role=admin
Status:
Allowed disruptions: 1
Current: 2
Desired: 1
Total: 2
Events:
Deployment(snippet):
Name: myapp-admin
Namespace: front-svc
CreationTimestamp: Wed, 26 May 2021 16:17:00 -0500
Labels: <none>
Annotations: deployment.kubernetes.io/revision: 104
Selector: agency=myorg,app=myapp,env=uat,organization=myorg,role=admin
Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 15
RollingUpdateStrategy: 25% max unavailable, 1 max surge
Pod Template:
Labels: agency=myorg
app=myapp
buildnumber=1234
env=uat
organization=myorg
role=admin
Annotations: kubectl.kubernetes.io/restartedAt: 2022-03-12T09:00:11Z
Containers:
myapp-admin-ctr:
Is there something obvious I am doing wrong here?
CodePudding user response:
... a max surge value of 100% provides the fastest possible upgrade process (doubling the node count) but also causes all nodes in the node pool to be drained simultaneously.
From the official documentation. You may want to consider lower down your max surge.