How to scale up Kubernetes cluster with Terraform avoiding downtime?-CodePudding

Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.

Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).

We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.

I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?

I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)

If there's any link, course, tutorial you can point me to it's highly appreciated.

CodePudding user response：

(Moved from comments)

In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade