I have a Kubernetes Cluster with pods autoscalables using Autopilot. Suddenly they stop to autoscale, I'm new at Kubernetes and I don't know exactly what to do or what is supposed to put in the console to show for help.
The pods automatically are Unschedulable and inside the cluster put his state at Pending instead of running and doesn't allow me to enter or interact.
Also I can't delete or stop them at GCP Console. There's no issue regarding memory or insufficient CPU because there's not much server running on it.
The cluster was working as expected before this issue I have.
Namespace: default
Priority: 0
Node: <none>
Labels: app=odoo-service
pod-template-hash=5bd88899d7
Annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/odoo-cluster-dev-5bd88899d7
Containers:
odoo-service:
Image: us-central1-docker.pkg.dev/adams-dev/adams-odoo/odoo-service:v58
Port: <none>
Host Port: <none>
Limits:
cpu: 2
ephemeral-storage: 1Gi
memory: 8Gi
Requests:
cpu: 2
ephemeral-storage: 1Gi
memory: 8Gi
Environment:
ODOO_HTTP_SOCKET_TIMEOUT: 30
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zqh5r (ro)
cloud-sql-proxy:
Image: gcr.io/cloudsql-docker/gce-proxy:1.17
Port: <none>
Host Port: <none>
Command:
/cloud_sql_proxy
-instances=adams-dev:us-central1:odoo-test=tcp:5432
Limits:
cpu: 1
ephemeral-storage: 1Gi
memory: 2Gi
Requests:
cpu: 1
ephemeral-storage: 1Gi
memory: 2Gi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zqh5r (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-zqh5r:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 28m (x248 over 3h53m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 in backoff after failed scale-up, 2 Insufficient cpu, 2 Insufficient memory
Normal NotTriggerScaleUp 8m1s (x261 over 3h55m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient memory, 4 in backoff after failed scale-up, 2 Insufficient cpu
Normal NotTriggerScaleUp 3m (x1646 over 3h56m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient cpu, 2 Insufficient memory, 4 in backoff after failed scale-up
Warning FailedScheduling 20s (x168 over 3h56m) gke.io/optimize-utilization-scheduler 0/2 nodes are available: 2 Insufficient cpu, 2 Insufficient memory.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 28m (x250 over 3h56m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient memory, 4 in backoff after failed scale-up, 2 Insufficient cpu
Normal NotTriggerScaleUp 8m2s (x300 over 3h55m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 in backoff after failed scale-up, 2 Insufficient cpu, 2 Insufficient memory
Warning FailedScheduling 5m21s (x164 over 3h56m) gke.io/optimize-utilization-scheduler 0/2 nodes are available: 2 Insufficient cpu, 2 Insufficient memory.
Normal NotTriggerScaleUp 3m1s (x1616 over 3h55m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient cpu, 2 Insufficient memory, 4 in backoff after failed scale-up
I don't know how much I can debug or fix it.
CodePudding user response:
Pods failed to schedule on any node because none of the nodes have cpu available.
Cluster autoscaler tried to scale up but it backoff after failed scale-up attempt which indicates possible issues with scaling up managed instance groups which are part of the node pool.
Cluster autoscaler tried to scale up but as the quota limit is reached no new nodes can be added.
You can't see the Autopilot GKE VMs that are being counted against your quota.
Try by creating the autopilot cluster in another region. If your needs are not no longer fulfilled by an autopilot cluster then go for a standard cluster.