systemd watchdog for frozen application-CodePudding

I was checking how the watchdog is implemented in systemd. I saw some examples. In these examples I couldn't see that application sends feedback to systemd, that it is alive.

How systemd will know and restart application if it is frozen? Because in this case, it is not crashed, application is there but frozen.

CodePudding user response：

Setting the watchdog in a service file consists to set the WatchdogSec key with a refresh period. For example, the udev service defines a watchdog refresh of 3 minutes:

$ cat /usr/lib/systemd/system/udev.service | grep -Ei "(watchdog|execstart)"
ExecStart=/lib/systemd/systemd-udevd
WatchdogSec=3min

When setting this key, the application is supposed to send at least one watchdog refresh message to systemd every 3 minutes. This can be done with systemd's sd_notify() API to which is passed the string "WATCHDOG=1".
Hence, systemd will kill an application which does not send its refresh message in time. It will eventually restart it depending on the other configuration keys in the service file as described in the manual:

WatchdogSec=
Configures the watchdog timeout for a service. The watchdog is activated when the start-up is completed. The service must call sd_notify(3) regularly with "WATCHDOG=1" (i.e. the "keep-alive ping"). If the time between two such calls is larger than the configured time, then the service is placed in a failed state and it will be terminated with SIGABRT (or the signal specified by WatchdogSignal=). By setting Restart= to on-failure, on-watchdog, on-abnormal or always, the service will be automatically restarted. The time configured here will be passed to the executed service process in the WATCHDOG_USEC= environment variable. This allows daemons to automatically enable the keep-alive pinging logic if watchdog support is enabled for the service. If this option is used, NotifyAccess= (see below) should be set to open access to the notification socket provided by systemd. If NotifyAccess= is not set, it will be implicitly set to main. Defaults to 0, which disables this feature. The service can check whether the service manager expects watchdog keep-alive notifications. See sd_watchdog_enabled(3) for details. sd_event_set_watchdog(3) may be used to enable automatic watchdog notification support.

Tip: The watchdog refresh messages sent to systemd can be spied with a tool like strace.
systemd is normally process#1:

$ ls -l /sbin/init
lrwxrwxrwx 1 root root 20 avril  21 14:54 /sbin/init -> /lib/systemd/systemd
$ pidof init      
1

Here is how we can spy the refresh messages sent by systemd-udevd process "greping" the systemd's received messages containing the pid of the application process (we use -tt option to have the timestamps):

$ sudo strace -tt -p1 2>&1 | grep pid=`pidof systemd-udevd`
09:01:17.348636 recvmsg(16, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="WATCHDOG=1", iov_len=4096}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=328, uid=0, gid=0}}], msg_controllen=32, msg_flags=MSG_CMSG_CLOEXEC}, MSG_TRUNC|MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 10
09:02:53.345089 recvmsg(16, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="WATCHDOG=1", iov_len=4096}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=328, uid=0, gid=0}}], msg_controllen=32, msg_flags=MSG_CMSG_CLOEXEC}, MSG_TRUNC|MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 10
[...]

CodePudding user response：

systemd only checks if the PID still exists. There is to my knowledge no standard way to implement "custom watchdog checks" to verify if a processes is not stuck on a mutex or so. You can for instance implement your own watchdog as a separate systemd unit, that performs a custom check on the supervised process.