Reaping children in a pre-forking server-CodePudding

In the Programming Language Examples Alike Cookbook's chapter on Sockets, the "Pre-Forking Servers" section uses a SIGCHLD handler like this:

module PidSet = Set.Make(struct type t = int let compare = compare end)
let children = ref PidSet.empty

(* takes care of dead children *)
let rec reaper _ =
  Sys.set_signal Sys.sigchld (Sys.Signal_handle reaper);
  match Unix.wait ()
  with (pid, _) -> children := PidSet.remove pid !children

(* ... *)

let () =
  (* ... *)
  Sys.set_signal Sys.sigchld (Sys.Signal_handle reaper);
  (* ... *)

The reaper function looks wrong, but I am not really sure. I am concerned about the case when two or more children happen to exit near-simultaneously. The reaper function might only run once because signals are not queued in Unix. The result is that only one child is reaped, while the other children remain zombies.

I looked at an alternative reaper function from the "Forking Servers" section:

let rec reaper signal =
  try while true do ignore (Unix.waitpid [Unix.WNOHANG] (-1)) done
  with Unix.Unix_error (Unix.ECHILD, _, _) -> ();
  Sys.set_signal Sys.sigchld (Sys.Signal_handle reaper)

However, this also seems wrong because Unix.waitpid [Unix.WNOHANG] (-1) might never cause any Unix.Unix_error, resulting in an a huge number of loops where Unix.waitpid [Unix.WNOHANG] (-1) always returns a pid value of 0.

This is my attempt at writing a correct SIGCHLD handler for reaping all terminated children for the "Pre-Forking Servers" section:

let rec reaper _ =
  try
    while true do
      let (pid, _) = Unix.waitpid [Unix.WNOHANG] (-1) in
      if pid > 0 then
        children := PidSet.remove pid !children
      else
        raise Not_found  (* Exit loop. *)
    done
  with Not_found -> ();
  Sys.set_signal Sys.sigchld (Sys.Signal_handle reaper)

Is this correct?

CodePudding user response：

Here's the version that I am using for many years. This function is set as the Sys.sigchld handler with Sys.set_signal Sys.sigchld (Sys.Signal_handle reap).

let rec reap n = match Unix.waitpid [Unix.WNOHANG] (-1) with
  | 0,_ -> ()
  | _   -> reap n
  | exception Unix.Unix_error (Unix.ECHILD,_,_) -> ()
  | exception _ -> (* log it *) ()

Note, that the function calls recursively itself when waitpid returns a child up until it exhausts all finished children, and then it waits again for the signal.

The ECHILD exception is hushed, as it commonly happens when you try to ripe an already reaped child (which happens in OCaml, see the P.P.S note below). In the real application, I log the rest of the exceptions, just to be sure that the system behaves as I expect (various Unices have different behavior, the code above is thoroughly tested on Linux and macOS (Darwin) only).

Concerning your implementation, it is quite similar, except that you're not explicitly capturing exceptions, so they could arise in the random parts of your code (signal handlers are called asynchronously during memory allocations, so it is better not to allow the exceptions to jump out of them). I don't know why there's a need for a children set, probably it is application-specific, but you don't need this to keep your process table free of zombies, as such set already exists in the kernel.

P.S., as an important aside, for OCaml versions that are less than 4.13, if your server application is not performing any allocations or blocking calls, e.g., if it is doing while true; do () done or let rec loop () = loop (), then no signal handlers will be called. For more modern versions of OCaml (anything starting from 4.13 and above), this is no longer true, thanks to @octachron for the clarification.

P.P.S., and another caveat wrt the OCaml signal handling,

The reaper function might only run once because signals are not queued in Unix.

This is true in general, except that in OCaml signals are indeed queued. When a signal arrives it is queued and waits until the next synchronization point, i.e., the next allocation or blocking system call. Once it happens, OCaml calls signal handlers for each signal pending in the queue. Of course, there's no guarantee that no signals will be lost, so the conservative assumption that you have still works - we can't be sure that we receive one signal per child. But what can happen is that we can receive a signal for a child that we already reaped in the previous trigger of the reap signal handler. This is why I have ECHILD explicitly hushed.

P.P.P.S.,

In my reaper function, I call Sys.set_signal Sys.sigchld (Sys.Signal_handle reaper) again. Is that necessary? I notice that your implementation does not have it.

You don't need to reinstall the signal handler. Once it is installed it will be there until reinstalled.¹ Though, reinstalling it won't really hurt I would advise against having a recursive function that doesn't really call itself recursively but via a signal handler. It violates the single function single responsibility principle, as your function now has two orthogonal responsibilities - one is to reap the children and another is to install itself as the signal handler. It is much better to have the signal handler that does its work and install it separately and once in the server initialization procedure.

¹⁾ In a POSIX-compliant system, or in fact if sigaction and sigprocmask are available, OCaml uses sigaction to install signals and ensures that the signal handler is kept attached. Therefore it is guaranteed that on Linux and macOS you don't need to reinstall the signals. When BSD semantics is detected (and POSIX signals are not detected), the signal function will be used, but it will still keep the handler attached (as it is the BSD semantics). There is still a possibility of a system that lacks both POSIX-compliant signal system and BSD signals, i.e., pure System V, but most likely for such a system, our code will have many other problems, like most of the Unix functions will be unimplemented, including waitpid. With that said, you can reinstall the signal, it won't hurt, but I wouldn't do it.