I am currently using the following to attempt to spread Kubernetes pods in a given deployment across all Kubernetes nodes evenly:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
However, I noticed that a new attribute topologySpreadConstraints
was recently added to Kubernetes. What's the advantage of switching from using affinity.podAntiAffinity
to topologySpreadConstraints
in Kubernetes deployments? Any reasons to switch? What would that syntax look like to match what I am currently doing above?
CodePudding user response:
Summary
TL;DR - The advantage of switching to topologySpreadConstraints
is that you will be able to be more expressive about the topology
or the structure
of your underlying infrastructure for pod scheduling. Think of this is as a superset of what affinity
can do.
One concept is not the replacement for the other, but both can be used for different purposes. You can combine pod/nodeAffinity
with topologySpreadConstraints
and they will be ANDed by the Kubernetes scheduler when scheduling pods. In short, pod/nodeAffinity
is for linear topologies (all nodes on the same level) and topologySpreadConstraints
are for hierarchical topologies (nodes spread across logical domains of topology). And when combined, the scheduler ensures that both are respected and both are used to ensure certain criteria, like high availability of your applications.
Keep on reading for more details!
Affinities vs Topology Spread
With affinities
you can decide which nodes
your pods
are scheduled onto, based on a node
label
(in your case kubernetes.io/hostname
).
With topologySpreadConstraints
you can decide which nodes
your pods
are scheduled onto using a wider set of labels that define your topology domain
. So, this is a generalisation of the simple affinity
concept where all your nodes are "on the same topology level" - logically speaking - and on smaller scales, this is a simplified view of managing pod scheduling.
An Example
A topology domain
is simply a logical unit of your infrastructure. Imagine you have a cluster with 10 nodes, that are logically on the same level and your topology domain
represents simply a flat topology where all those nodes are at the same level.
node1, node2, node3 ... node10
Now, imagine your cluster grows to have 20 nodes, 10 nodes in one Availability Zone (in your cloud provider) and 10 on another AZ. Now, your topology domain
can be an Availability Zone and therefore, all the nodes are not at the same level, your topology has become "multi-zonal" now and instead of having 20 nodes in the same topology, you now have 20 nodes, 10 in each topological domain (AZ).
AZ1 => node1, node2, node3 ... node10
AZ2 => node11, node12, node13 ... node20
Imagine it grows further to 40 nodes, 20 in each region
, where each region can have 2 AZs (10 nodes each). A "multi-regional" topology with 2 types of topology domains
, an AZ and a region. It now looks something like:
Region1: => AZ1 => node1, node2, node3 ... node10
=> AZ2 => node11, node12, node13 ... node20
Region2: => AZ1 => node21, node22, node23 ... node30
=> AZ2 => node31, node32, node33 ... node40
Now, here's an idea. When scheduling your workload pods, you would like the scheduler to be aware of the topology of your underlying infrastructure that provides your Kubernetes nodes. This can be your own data center, a cloud provider etc. This is because you would like to ensure, for instance, that:
- You get an equal number of pods across regions, so your multi-regional application has similar capacities.
- You can have an equal number of pods across AZs within a region, so AZs are not overloaded.
- You have scaling constraints where you would like to prefer to scale an application equally across regions and so on.
In order for the Kubernetes scheduler to be aware of this underlying topology that you have setup, you can use the topologySpreadConstraints
to tell the scheduler how to interpret a "list of nodes" that it sees. Because remember, to the scheduler, all the nodes are just a flat list of nodes, there is no concept of a topology. You can build a topology by attaching special labels to nodes called topologyKey
labels. For example, you will label each node in your cluster to make Kubernetes scheduler understand what kind of underlying "topology" you have. Like,
node1 => az: 1, region: 1
...
node11 => az: 2, region: 1
...
node21 => az: 1, region: 2
...
node31 => az: 2, region: 2
Now each node has been configured to be a part of "two" topological domains; each node must be in an AZ
and in a Region
. So, you can start configuring your topologySpreadConstraints
to make the scheduler spread pods across regions, AZs etc. (your topology domains
) and meet your requirements.
This is a very common use case that a lot of organisations implement with their workloads in order to ensure high availability of applications when they grow very large and become multi-regional, for instance. You can read more about topologySpreadConstraints
here.