Home > database >  How to avoid downtime during scheduled maintenance window
How to avoid downtime during scheduled maintenance window

Time:05-04

I'm experiencing downtimes whenever the GKE cluster gets upgraded during the maintenance window. My services (APIs) become unreachable for like ~5min.

The cluster Location type is set to "Zonal", and all my pods have 2 replicas. The only affected pods seem to be the ones using nginx ingress controller.

Is there anything I can do to prevent this? I read that using Regional clusters should prevent downtimes in the control plane, but I'm not sure if it's related to my case. Any hints would be appreciated!

CodePudding user response:

You mention "downtime" but is this downtime for you using the control plane (i.e. kubectl stop working) or is it downtime in that the end user who is using the services stops seeing the service working.

A GKE upgrade upgrades two parts of the cluster: the control plane or master nodes, and the worker nodes. These are two separate upgrades although they can happen at the same time depending on your configuration of the cluster.

Regional clusters can help with that, but they will cost more as you are having more nodes, but the upside is that the cluster is more resilient.

Going back to the earlier point about the control plane vs node upgrades. The control plane upgrade does NOT affect the end-user/customer perspective. The services will remaining running.

The node upgrade WILL affect the customer so you should consider various techniques to ensure high availability and resiliency on your services.

A common technique is to increase replicas and also to include pod antiaffinity. This will ensure the pods are scheduled on different nodes, so when the node upgrade comes around, it doesn't take the entire service out because the cluster scheduled all the replicas on the same node.

You mention the nginx ingress controller in your question. If you are using Helm to install that into your cluster, then out of the box, it is not setup to use anti-affinity, so it is liable to be taken out of service if all of its replicas get scheduled onto the same node, and then that node gets marked for upgrade or similar.

  • Related