We upgraded our Kubernetes cluster (running on GKE) from version 1.19 to 1.21 and since then we have not been able to deploy one of our deployments. The relevant parts of the deployment are defined like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
labels:
name: my-deployment
spec:
replicas: 2
revisionHistoryLimit: 10
strategy:
type: "RollingUpdate"
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
name: "my-deployment"
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: name
operator: In
values:
- my-deployment
- my-other-deployment
topologyKey: "kubernetes.io/hostname"
nodeSelector:
cloud.google.com/gke-nodepool: somenodepool
...
We're running a 5 node cluster and "my-other-deployment" only has one replica of its pod. So before we start the roll-out process, two nodes should be available to schedule the new "my-deployment" pod. This has worked fine for years, but after upgrading the server to version v1.21.10-gke.2000, the rollout process now fails with:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 50s (x2 over 52s) default-scheduler 0/5 nodes are available: 1 Insufficient cpu, 1 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules, 4 node(s) didn't match pod affinity/anti-affinity rules.
Normal NotTriggerScaleUp 50s cluster-autoscaler pod didn't trigger scale-up:
Normal Scheduled 20s default-scheduler Successfully assigned default/my-deployment-7f66984b9f-bqs8l to gke-v1-21-10-gke-2000-n1-standar-9b2c965a-lz4j
Normal Pulled 19s kubelet Container image "somerepo/something/my-deployment:589" already present on machine
Normal Created 19s kubelet Created container my-deployment
Normal Started 19s kubelet Started container my-deployment
What could be the reason for this and how do we fix it?
CodePudding user response:
I don't know about anything that changed in terms of (anti)affinity in 1.19 vs 1.21. Maybe check:
- Are there other deployments with the same name, triggering the anti-affinity?
- Is the nodepool name correct?
- Are all the nodes in your nodepool schedulable ?
CodePudding user response:
The problem was that there was not enough CPU available on the remaining nodes to fulfill the CPU resource request of the pod. The way this is enforced has probably changed in 1.20 or 1.21, since it has not been an issue before.