pod readinessprobe issue with database and container-CodePudding

I have an application deployed to kubernetes. Here is techstack: Java 11, Spring Boot 2.3.x or 2.5.x, using hikari 3.x or 4.x

Using spring actuator to do healthcheck. Here is liveness and readiness configuration within application.yaml:

  endpoint:
    health:
      group:
        liveness:
          include: '*'
          exclude:
            - db
            - readinessState
        readiness:
          include: '*'

what it does if DB is down -

Makes sure liveness doesn't get impacted - meaning, application container should keep on running even if there is DB outage.
readinesss will be impacted making sure no traffic is allowed to hit the container.

liveness and readiness configuration in container spec:

livenessProbe:
  httpGet:
      path: actuator/health/liveness
      port: 8443
      scheme: HTTPS
  initialDelaySeconds: 30
  periodSeconds: 30
  timeoutSeconds: 5
readinessProbe:
  httpGet:
      path: actuator/health/readiness
      port: 8443
      scheme: HTTPS
  initialDelaySeconds: 30
  periodSeconds: 30
  timeoutSeconds: 20

My application is started and running fine for few hours.

What I did:

I brought down DB.

Issue Noticed:

When DB is down, after 90 seconds I see 3 more pods getting spinned up. When a pod is described I see Status and condition like below:

Status:       Running
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True

when I list all running pods:

NAME                                                  READY   STATUS    RESTARTS   AGE
application-a-dev-deployment-success-5d86b4bcf4-7lsqx    0/1     Running   0          6h48m
application-a-dev-deployment-success-5d86b4bcf4-cmwd7    0/1     Running   0          49m
application-a-dev-deployment-success-5d86b4bcf4-flf7r    0/1     Running   0          48m
application-a-dev-deployment-success-5d86b4bcf4-m5nk7    0/1     Running   0          6h48m
application-a-dev-deployment-success-5d86b4bcf4-tx4rl    0/1     Running   0          49m

My Analogy/Finding:

Per ReadinessProbe configuration: periodSeconds is set to 30 seconds and failurethreshold is defaulted to 3 per k8s documentation.

Per application.yaml readiness includes db check, meaning after every 30 seconds readiness check failed. When it fails 3 times, failurethreshold is met and it spins up new pods.

Restart policy is default to Always.

Questions:

Why it spinned new pods?
Why it spinned specifically only 3 pods and not 1 or 2 or 4 or any number?
Does this has to do anything with restartpolicy?

CodePudding user response：

As you answered to yourself, it spinned new pods after 3 times tries according to failureThreshold. You can change your restartPolicy to OnFailure, it will allow you to restart the job only if it fails or Never if you don't want have the cluster to be restarted. The difference between the statuses you can find here. Note this:

The restartPolicy applies to all containers in the Pod. restartPolicy only refers to restarts of the containers by the kubelet on the same node. After containers in a Pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, …), that is capped at five minutes. Once a container has executed for 10 minutes without any problems, the kubelet resets the restart backoff timer for that container.

Share your full Deployment file, I suppose that you've set replicas number to 3.
Answered in the answer for the 1st question.

Also note this, if this works for you:

Startup probes are useful for Pods that have containers that take a long time to come into service. Rather than set a long liveness interval, you can configure a separate configuration for probing the container as it starts up, allowing a time longer than the liveness interval would allow.

If your container usually starts in more than initialDelaySeconds failureThreshold × periodSeconds, you should specify a startup probe that checks the same endpoint as the liveness probe. The default for periodSeconds is 10s. You should then set its failureThreshold high enough to allow the container to start, without changing the default values of the liveness probe. This helps to protect against deadlocks.

CodePudding user response：

The crux lied in HPA. CPU utilization of POD after readiness failure used to jump up and as it was going above 70% HPA was getting triggered and started those 3 pods.