Kubernetes spread pods across nodes using podAntiAffinity vs topologySpreadConstraints-CodePudding

I am currently using the following to attempt to spread Kubernetes pods in a given deployment across all Kubernetes nodes evenly:

    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - api

However, I noticed that a new attribute topologySpreadConstraints was recently added to Kubernetes. What's the advantage of switching from using affinity.podAntiAffinity to topologySpreadConstraints in Kubernetes deployments? Any reasons to switch? What would that syntax look like to match what I am currently doing above?

CodePudding user response：

Summary

TL;DR - The advantage of switching to topologySpreadConstraints is that you will be able to be more expressive about the topology or the structure of your underlying infrastructure for pod scheduling. Think of this is as a superset of what affinity can do.

One concept is not the replacement for the other, but both can be used for different purposes. You can combine pod/nodeAffinity with topologySpreadConstraints and they will be ANDed by the Kubernetes scheduler when scheduling pods. In short, pod/nodeAffinity is for linear topologies (all nodes on the same level) and topologySpreadConstraints are for hierarchical topologies (nodes spread across logical domains of topology). And when combined, the scheduler ensures that both are respected and both are used to ensure certain criteria, like high availability of your applications.

Keep on reading for more details!

Affinities vs Topology Spread

With affinities you can decide which nodes your pods are scheduled onto, based on a node label (in your case kubernetes.io/hostname).

With topologySpreadConstraints you can decide which nodes your pods are scheduled onto using a wider set of labels that define your topology domain. So, this is a generalisation of the simple affinity concept where all your nodes are "on the same topology level" - logically speaking - and on smaller scales, this is a simplified view of managing pod scheduling.

An Example

A topology domain is simply a logical unit of your infrastructure. Imagine you have a cluster with 10 nodes, that are logically on the same level and your topology domain represents simply a flat topology where all those nodes are at the same level.

node1, node2, node3 ... node10

Now, imagine your cluster grows to have 20 nodes, 10 nodes in one Availability Zone (in your cloud provider) and 10 on another AZ. Now, your topology domain can be an Availability Zone and therefore, all the nodes are not at the same level, your topology has become "multi-zonal" now and instead of having 20 nodes in the same topology, you now have 20 nodes, 10 in each topological domain (AZ).

AZ1 => node1, node2, node3 ... node10
AZ2 => node11, node12, node13 ... node20

Imagine it grows further to 40 nodes, 20 in each region, where each region can have 2 AZs (10 nodes each). A "multi-regional" topology with 2 types of topology domains, an AZ and a region. It now looks something like:

Region1: => AZ1 => node1, node2, node3 ... node10
         => AZ2 => node11, node12, node13 ... node20
Region2: => AZ1 => node21, node22, node23 ... node30
         => AZ2 => node31, node32, node33 ... node40

Now, here's an idea. When scheduling your workload pods, you would like the scheduler to be aware of the topology of your underlying infrastructure that provides your Kubernetes nodes. This can be your own data center, a cloud provider etc. This is because you would like to ensure, for instance, that:

You get an equal number of pods across regions, so your multi-regional application has similar capacities.
You can have an equal number of pods across AZs within a region, so AZs are not overloaded.
You have scaling constraints where you would like to prefer to scale an application equally across regions and so on.

In order for the Kubernetes scheduler to be aware of this underlying topology that you have setup, you can use the topologySpreadConstraints to tell the scheduler how to interpret a "list of nodes" that it sees. Because remember, to the scheduler, all the nodes are just a flat list of nodes, there is no concept of a topology. You can build a topology by attaching special labels to nodes called topologyKey labels. For example, you will label each node in your cluster to make Kubernetes scheduler understand what kind of underlying "topology" you have. Like,

node1 =>  az: 1, region: 1
...
node11 => az: 2, region: 1
...
node21 => az: 1, region: 2
...
node31 => az: 2, region: 2

Now each node has been configured to be a part of "two" topological domains; each node must be in an AZ and in a Region. So, you can start configuring your topologySpreadConstraints to make the scheduler spread pods across regions, AZs etc. (your topology domains) and meet your requirements.

This is a very common use case that a lot of organisations implement with their workloads in order to ensure high availability of applications when they grow very large and become multi-regional, for instance. You can read more about topologySpreadConstraints here.