Home > Mobile >  How to make a celery worker stop receiving new tasks (Kubernetes)
How to make a celery worker stop receiving new tasks (Kubernetes)

Time:08-04

So we have a kubernetes cluster running some pods with celery workers. We are using python3.6 to run those workers and celery version is 3.1.2 (I know, really old, we are working on upgrading it). We have also setup some autoscaling mechanism to add more celery workers on the fly.

The problem is the following. So let's say we have 5 workers at any given time. Then lot of tasks come, increasing the CPU/RAM usage of the pods. That triggers an autoscaling event, adding, let's say, two more celery worker pods. So now those two new celery workers take some long running tasks. Before they finishing running those tasks, kubernetes creates a downscaling event, killing those two workers, and killing those long running tasks too.

Also, for legacy reasons, we do not have a retry mechanism if a task is not completed (and we cannot implement one right now).

So my question is, is there a way to tell kubernetes to wait for the celery worker to have run all of its pending tasks? I suppose the solution must include some way to notify the celery worker to make it stop receiving new tasks also. Right now I know that Kubernetes has some scripts to handle this kind of situations, but I do not know what to write on those scripts because I do not know how to make the celery worker stop receiving tasks.

Any idea?

CodePudding user response:

I wrote a blog post exactly on that topic - check it out.

When Kubernetes decide to kill a pod, it first send SIGTERM signal, so your Application have time to gracefully shutdown, and after that if your Application didn't end - Kubernetes will kill it by sending a SIGKILL signal.

This period, between SIGTERM to SIGKILL can be tuned by terminationGracePeriodSeconds (more about it here).

In other words, if your longest task takes 5 minutes, make sure to set this value to something higher than 300 seconds.

Celery handle those signals for you as you can see here (I guess it is relevant for your version as well):

Shutdown should be accomplished using the TERM signal.

When shutdown is initiated the worker will finish all currently executing tasks before it actually terminates. If these tasks are important, you should wait for it to finish before doing anything drastic, like sending the KILL signal.

As explained in the docs, you can set the acks_late=True configuration so the task will run again if it stopped accidentally.

Another thing that I didn't find documentation for (almost sure I saw it somewhere) - Celery worker won't receive a new tasks after getting a SIGTERM - so you should be safe to terminate the worker (might require to set worker_prefetch_multiplier = 1 as well).

  • Related