Home > Software engineering >  Server shutdown never finishes
Server shutdown never finishes

Time:09-23

i have asynchronous gRPC server (gRPC version 1.40.X, Windows 10 x64).

When i try to shutdown the server, sometime is shutdowns gracefully and other times the thread responsible for running the gRPC gets stuck while shutdown procedure has began. the more clients spam the server the more likely it's to happen.

My shutdown procedure:

  1. Wait for the current Que->AsyncNext to finish
  2. I call TryToCancel on any calls that has received the AsyncNotifyWhenDone, this could be causing the issue as i can't really call this on all calls as for some the AsyncNotifyWhenDone has not yet been recived, not sure how to approach this as calling finish() is done after GOT_EVENT
  3. I call Server->Shutdown(), here it gets stuck and the thread hangs forever.
  4. Then Que->Shutdown()
  5. Then DrainQue() sync function that goes trough all the remaining calls with Que->Next
  6. Clears all call data

Here is the last trace when it happens:

I0921 08:44:56.768000000 17588 init.cc:167] grpc_init(void)
I0921 08:44:56.768000000 17588 completion_queue.cc:522] grpc_completion_queue_create_internal(completion_type=0, polling_type=0)
I0921 08:44:56.768000000 17588 server.cc:1536] grpc_server_shutdown_and_notify(server=000001538016BEE8, cq=0000015415E29A90, tag=0000004D816FF208)
I0921 08:44:56.768000000 17588 completion_queue.cc:701] cq_end_op_for_next(cq=00000153AE87C900, tag=00000153800F7040, error={"created":"@1632206696.768000000","description":"Server Shutdown","file":"\grpc\src\core\lib\surface\server.cc","file_line":832}, done=00007FFD47DE9440, done_arg=0000015380133F40, storage=0000015380133F68)
I0921 08:44:56.768000000 17588 completion_queue.cc:701] cq_end_op_for_next(cq=00000153AE87C900, tag=00000153800EFE00, error={"created":"@1632206696.768000000","description":"Server Shutdown","file":"\grpc\src\core\lib\surface\server.cc","file_line":832}, done=00007FFD47DE9440, done_arg=0000015380133E80, storage=0000015380133EA8)
I0921 08:44:56.768000000 17588 completion_queue.cc:701] cq_end_op_for_next(cq=00000153AE87C900, tag=00000153800EFC00, error={"created":"@1632206696.768000000","description":"Server Shutdown","file":"\grpc\src\core\lib\surface\server.cc","file_line":832}, done=00007FFD47DE9440, done_arg=0000015380133DC0, storage=0000015380133DE8)
I0921 08:44:56.768000000 17588 chttp2_transport.cc:1752] ipv4:127.0.0.1:56702: Sending goaway err={"created":"@1632206696.768000000","description":"Server shutdown","file":"\grpc\src\core\lib\surface\server.cc","file_line":480,"grpc_status":0}
I0921 08:44:56.768000000 17588 completion_queue.cc:1419] grpc_completion_queue_shutdown(cq=0000015415E29A90)
I0921 08:44:56.768000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=0000015415E29A90, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 0 }, reserved=0000000000000000)

(beginning part of paths has been removed to respect my privacy)

When clients has fully disconnects then it unfreezes and continues with:

I0921 08:47:02.540000000 17588 completion_queue.cc:701] cq_end_op_for_next(cq=0000015415E29A90, tag=0000004D816FF208, error="No Error", done=00007FFD47DE9450, done_arg=0000015380127F00, storage=0000015380137F90)
I0921 08:47:02.540000000 17588 completion_queue.cc:701] cq_end_op_for_next(cq=00000153AE87C900, tag=0000015421DDF220, error="No Error", done=00007FFD47DE07B0, done_arg=0000015421DDF3C0, storage=0000015421DDF408)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[0000015415E29A90]: OP_COMPLETE: tag:0x4d816ff208 OK
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=0000015415E29A90, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 0 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[0000015415E29A90]: QUEUE_SHUTDOWN
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=0000015415E29A90, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[0000015415E29A90]: QUEUE_SHUTDOWN
I0921 08:47:02.540000000 17588 completion_queue.cc:1425] grpc_completion_queue_destroy(cq=0000015415E29A90)
I0921 08:47:02.540000000 17588 completion_queue.cc:1419] grpc_completion_queue_shutdown(cq=0000015415E29A90)
I0921 08:47:02.540000000 17588 init.cc:213] grpc_shutdown(void)
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=00000153AE87C900, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[00000153AE87C900]: OP_COMPLETE: tag:0x153800f7040 ERROR
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000153800F7308)
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=00000153AE87C900, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[00000153AE87C900]: OP_COMPLETE: tag:0x153800efe00 ERROR
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=00000153AE87C900, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[00000153AE87C900]: OP_COMPLETE: tag:0x153800efc00 ERROR
I0921 08:47:02.540000000 17588 completion_queue.cc:1419] grpc_completion_queue_shutdown(cq=00000153AE87C900)
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=00000153AE87C900, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[00000153AE87C900]: OP_COMPLETE: tag:0x15421ddf220 OK
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=00000153AE87C900, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[00000153AE87C900]: QUEUE_SHUTDOWN
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000153800FB660)
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000153800F96E0)
I0921 08:47:02.540000000 17588 call.cc:590] grpc_call_unref(c=0000015421DDCAE0)
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=0000015421DDD5C8)
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000153800FABE0)
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000153800FA160)
I0921 08:47:02.540000000 17588 call.cc:590] grpc_call_unref(c=00000154221F10E0)
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000154221F1BC8)
I0921 08:47:02.540000000 17588 init.cc:213] grpc_shutdown(void)
I0921 08:47:03.297000000 28004 server.cc:1550] grpc_server_destroy(server=000001538016BEE8)
I0921 08:47:03.297000000 28004 init.cc:213] grpc_shutdown(void)
I0921 08:47:03.297000000 28004 completion_queue.cc:1425] grpc_completion_queue_destroy(cq=00000153AE87C900)
I0921 08:47:03.297000000 28004 completion_queue.cc:1419] grpc_completion_queue_shutdown(cq=00000153AE87C900)
I0921 08:47:03.297000000 28004 init.cc:213] grpc_shutdown(void)

this is somewhat related to my previous question, that i thought was solved: gRPC Server shutdown hangs forever

i do not really understand the trace, it only given me more questions then answers, for example why it's calling grpc_init, why are grpc_shutdown called 3*, why it's doing next in the shutdown function it self.

Any help would be appreciated

Best, Jan

CodePudding user response:

After some digging i have found a solution, Shutdown() has two overloads, with deadline and without deadline, the one i have been using (without deadline) just waits forever, and the one with the deadline waits only the deadline time.

so my new shutdown looks like this:

const std::chrono::milliseconds WaitDuration = std::chrono::milliseconds(50);
const std::chrono::time_point<system_clock> Deadline = std::chrono::system_clock::now()   WaitDuration;

Server->Shutdown(Deadline);

50 ms is for now selected arbitrarily as in documentation there is again no recommendations or best practices, so it's subject to change.

  • Related