Pod no longer deployable after upgrading to Kubernetes 1.21 due to anti-affinity rules-CodePudding

We upgraded our Kubernetes cluster (running on GKE) from version 1.19 to 1.21 and since then we have not been able to deploy one of our deployments. The relevant parts of the deployment are defined like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
  labels:
    name: my-deployment
spec:
  replicas: 2
  revisionHistoryLimit: 10
  strategy:
    type: "RollingUpdate"
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      name: "my-deployment"
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: name
                    operator: In
                    values:
                      - my-deployment
                      - my-other-deployment
              topologyKey: "kubernetes.io/hostname"
      nodeSelector:
        cloud.google.com/gke-nodepool: somenodepool
...

We're running a 5 node cluster and "my-other-deployment" only has one replica of its pod. So before we start the roll-out process, two nodes should be available to schedule the new "my-deployment" pod. This has worked fine for years, but after upgrading the server to version v1.21.10-gke.2000, the rollout process now fails with:

Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   50s (x2 over 52s)  default-scheduler   0/5 nodes are available: 1 Insufficient cpu, 1 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules, 4 node(s) didn't match pod affinity/anti-affinity rules.
  Normal   NotTriggerScaleUp  50s                cluster-autoscaler  pod didn't trigger scale-up:
  Normal   Scheduled          20s                default-scheduler   Successfully assigned default/my-deployment-7f66984b9f-bqs8l to gke-v1-21-10-gke-2000-n1-standar-9b2c965a-lz4j
  Normal   Pulled             19s                kubelet             Container image "somerepo/something/my-deployment:589" already present on machine
  Normal   Created            19s                kubelet             Created container my-deployment
  Normal   Started            19s                kubelet             Started container my-deployment

What could be the reason for this and how do we fix it?

CodePudding user response：

I don't know about anything that changed in terms of (anti)affinity in 1.19 vs 1.21. Maybe check:

Are there other deployments with the same name, triggering the anti-affinity?
Is the nodepool name correct?
Are all the nodes in your nodepool schedulable ?

CodePudding user response：

The problem was that there was not enough CPU available on the remaining nodes to fulfill the CPU resource request of the pod. The way this is enforced has probably changed in 1.20 or 1.21, since it has not been an issue before.