Why does readiness probe start late?-CodePudding

I have a pod with probes following probes defined:

Liveness:   exec [/liveness-probe-script] delay=0s timeout=10s period=30s success=1 #failure=3
Readiness:  exec [/readiness-probe-script] delay=0s timeout=10s period=30s success=1 #failure=3
Startup:    exec [/startup-probe-script] delay=0s timeout=140s period=10s success=1 #failure=1

I added some logging to the scripts:

startup probe starting at: 14:57:34
finished at: 14:57:37

readiness probe starting at: 14:57:47
finished at: 14:57:47

liveness probe starting at: 14:57:48
finished at: 14:57:48

as you can see the readiness and liveness probes start 10s after the startup probe is already finished.

Can something be done about this? Is this normal behaviour?

EDIT:

For clarification, I understand that the liveness/readiness probes start AFTER the startup probe initialDelaySeconds if configured. My question is, why is there a 10s gap, sometimes more sometimes less, between the startup probe finishing at 14:57:37 and the readiness probe starting at 14:57:47 without any delays configured? Why does the readiness probe not start at 14:57:38 immediately after the startup probe?

CodePudding user response：

By definition, readiness and liveness probes take place after the container has started. This means after the startup probe is finished.

At the beginning of the doc:

The kubelet uses startup probes to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don't interfere with the application startup.

Then at the initialDelaySeconds parameter definition:

initialDelaySeconds: Number of seconds after the container has started before liveness or readiness probes are initiated. Defaults to 0 seconds. Minimum value is 0.

In case you want to avoid this behaviour, you may be able to do so by not using a startup probe, if your use case doesn't need it.

CodePudding user response：

I tried to reproduce your issue. I used OpenShift 4.10.3 with is K8S 1.23.3. I built a simple service that just responded to a few simple endpoints which served as probe endpoints.

My pod looks like this:

    Liveness:     http-get http://:8080/livez delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:8080/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Startup:      http-get http://:8080/startz delay=1s timeout=1s period=1s #success=1 #failure=30

Note that unlike you, I added a 1s delay on the startup and changed my startup period to 1s. I also had a timeout of 1s for all probes, which is the default. More on this in a second.

My application logs look like

2022-03-23 16:23:24,008 INFO (executor-thread-0) Startup probe
2022-03-23 16:23:24,222 INFO (executor-thread-0) Ready probe
2022-03-23 16:23:31,860 INFO  (executor-thread-0) Liveness probe

So we see that my readiness probe is called roughly 200 ms after the startup probe. I did some variation here, but it was always subsecond.

I did try with your same parameters, but 0 second delay almost always got a connection refused failure the first time. Because even though it was a tiny service, it still wasn't created fast enough if the startup probe was called immediately. Which caused a 10 second backoff. That's why I changed it to a 1 second startup probe and a 1 second period.

So, to answer your direct question, no this isn't normal. But you don't really show any of your pod information (such as the events). Nor do you list what you have already tried to troubleshoot this. So I'm not sure I can tell you what's problem.

I have a few possibilities.

Your 0s startup probe is failing and the 10s delay you are seeing before the ready probe is related to that timeout delay for the probe. Your application logs dispute this, but I'm finding the 10s delay just too coincidental with the fact that all of your probes have 10s timeouts. I'd get the events for the pods and see if you are seeing probe failures. I'd also change that timeout to 1s just under general principles, unless you have a reason to do otherwise.
As perhaps a related issue, you have defined your problems as exec scripts rather than directly as http-get in K8S. Which make me suspicious that something weird is going on in the scripts. Again, because the timeouts seem so coincidental. I'd try calling the HTTP endpoints directly and see if that resolves the issue.
You say that the gap is "sometimes more, sometimes less". I suppose the problem could just be an overloaded/unhealthy kubelet that just isn't scheduling probes promptly. Seems unlikely, because you would have bigger problems if you have 10s kubelet delays, but getting the kubelet logs could be helpful.

Regardless I'd definitely consider adjusting your probe timeouts and delays. I'd also consider removing the startup probe. Startup probes were really designed for legacy applications that weird startup states and long initialization phases. For your application is there really a startup state where all other probes are not meaningful?