i have asynchronous gRPC server (gRPC version 1.40.X, Windows 10 x64).
When i try to shutdown the server, sometime is shutdowns gracefully and other times the thread responsible for running the gRPC gets stuck while shutdown procedure has began. the more clients spam the server the more likely it's to happen.
My shutdown procedure:
- Wait for the current
Que->AsyncNext
to finish - I call
TryToCancel
on any calls that has received theAsyncNotifyWhenDone
, this could be causing the issue as i can't really call this on all calls as for some theAsyncNotifyWhenDone
has not yet been recived, not sure how to approach this as callingfinish()
is done after GOT_EVENT - I call
Server->Shutdown()
, here it gets stuck and the thread hangs forever. - Then
Que->Shutdown()
- Then
DrainQue()
sync function that goes trough all the remaining calls withQue->Next
- Clears all call data
Here is the last trace when it happens:
I0921 08:44:56.768000000 17588 init.cc:167] grpc_init(void)
I0921 08:44:56.768000000 17588 completion_queue.cc:522] grpc_completion_queue_create_internal(completion_type=0, polling_type=0)
I0921 08:44:56.768000000 17588 server.cc:1536] grpc_server_shutdown_and_notify(server=000001538016BEE8, cq=0000015415E29A90, tag=0000004D816FF208)
I0921 08:44:56.768000000 17588 completion_queue.cc:701] cq_end_op_for_next(cq=00000153AE87C900, tag=00000153800F7040, error={"created":"@1632206696.768000000","description":"Server Shutdown","file":"\grpc\src\core\lib\surface\server.cc","file_line":832}, done=00007FFD47DE9440, done_arg=0000015380133F40, storage=0000015380133F68)
I0921 08:44:56.768000000 17588 completion_queue.cc:701] cq_end_op_for_next(cq=00000153AE87C900, tag=00000153800EFE00, error={"created":"@1632206696.768000000","description":"Server Shutdown","file":"\grpc\src\core\lib\surface\server.cc","file_line":832}, done=00007FFD47DE9440, done_arg=0000015380133E80, storage=0000015380133EA8)
I0921 08:44:56.768000000 17588 completion_queue.cc:701] cq_end_op_for_next(cq=00000153AE87C900, tag=00000153800EFC00, error={"created":"@1632206696.768000000","description":"Server Shutdown","file":"\grpc\src\core\lib\surface\server.cc","file_line":832}, done=00007FFD47DE9440, done_arg=0000015380133DC0, storage=0000015380133DE8)
I0921 08:44:56.768000000 17588 chttp2_transport.cc:1752] ipv4:127.0.0.1:56702: Sending goaway err={"created":"@1632206696.768000000","description":"Server shutdown","file":"\grpc\src\core\lib\surface\server.cc","file_line":480,"grpc_status":0}
I0921 08:44:56.768000000 17588 completion_queue.cc:1419] grpc_completion_queue_shutdown(cq=0000015415E29A90)
I0921 08:44:56.768000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=0000015415E29A90, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 0 }, reserved=0000000000000000)
(beginning part of paths has been removed to respect my privacy)
When clients has fully disconnects then it unfreezes and continues with:
I0921 08:47:02.540000000 17588 completion_queue.cc:701] cq_end_op_for_next(cq=0000015415E29A90, tag=0000004D816FF208, error="No Error", done=00007FFD47DE9450, done_arg=0000015380127F00, storage=0000015380137F90)
I0921 08:47:02.540000000 17588 completion_queue.cc:701] cq_end_op_for_next(cq=00000153AE87C900, tag=0000015421DDF220, error="No Error", done=00007FFD47DE07B0, done_arg=0000015421DDF3C0, storage=0000015421DDF408)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[0000015415E29A90]: OP_COMPLETE: tag:0x4d816ff208 OK
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=0000015415E29A90, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 0 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[0000015415E29A90]: QUEUE_SHUTDOWN
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=0000015415E29A90, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[0000015415E29A90]: QUEUE_SHUTDOWN
I0921 08:47:02.540000000 17588 completion_queue.cc:1425] grpc_completion_queue_destroy(cq=0000015415E29A90)
I0921 08:47:02.540000000 17588 completion_queue.cc:1419] grpc_completion_queue_shutdown(cq=0000015415E29A90)
I0921 08:47:02.540000000 17588 init.cc:213] grpc_shutdown(void)
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=00000153AE87C900, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[00000153AE87C900]: OP_COMPLETE: tag:0x153800f7040 ERROR
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000153800F7308)
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=00000153AE87C900, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[00000153AE87C900]: OP_COMPLETE: tag:0x153800efe00 ERROR
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=00000153AE87C900, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[00000153AE87C900]: OP_COMPLETE: tag:0x153800efc00 ERROR
I0921 08:47:02.540000000 17588 completion_queue.cc:1419] grpc_completion_queue_shutdown(cq=00000153AE87C900)
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=00000153AE87C900, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[00000153AE87C900]: OP_COMPLETE: tag:0x15421ddf220 OK
I0921 08:47:02.540000000 17588 completion_queue.cc:979] grpc_completion_queue_next(cq=00000153AE87C900, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=0000000000000000)
I0921 08:47:02.540000000 17588 completion_queue.cc:1083] RETURN_EVENT[00000153AE87C900]: QUEUE_SHUTDOWN
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000153800FB660)
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000153800F96E0)
I0921 08:47:02.540000000 17588 call.cc:590] grpc_call_unref(c=0000015421DDCAE0)
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=0000015421DDD5C8)
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000153800FABE0)
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000153800FA160)
I0921 08:47:02.540000000 17588 call.cc:590] grpc_call_unref(c=00000154221F10E0)
I0921 08:47:02.540000000 17588 metadata_array.cc:34] grpc_metadata_array_destroy(array=00000154221F1BC8)
I0921 08:47:02.540000000 17588 init.cc:213] grpc_shutdown(void)
I0921 08:47:03.297000000 28004 server.cc:1550] grpc_server_destroy(server=000001538016BEE8)
I0921 08:47:03.297000000 28004 init.cc:213] grpc_shutdown(void)
I0921 08:47:03.297000000 28004 completion_queue.cc:1425] grpc_completion_queue_destroy(cq=00000153AE87C900)
I0921 08:47:03.297000000 28004 completion_queue.cc:1419] grpc_completion_queue_shutdown(cq=00000153AE87C900)
I0921 08:47:03.297000000 28004 init.cc:213] grpc_shutdown(void)
this is somewhat related to my previous question, that i thought was solved: gRPC Server shutdown hangs forever
i do not really understand the trace, it only given me more questions then answers, for example why it's calling grpc_init, why are grpc_shutdown called 3*, why it's doing next in the shutdown function it self.
Any help would be appreciated
Best, Jan
CodePudding user response:
After some digging i have found a solution, Shutdown()
has two overloads, with deadline and without deadline, the one i have been using (without deadline) just waits forever, and the one with the deadline waits only the deadline time.
so my new shutdown looks like this:
const std::chrono::milliseconds WaitDuration = std::chrono::milliseconds(50);
const std::chrono::time_point<system_clock> Deadline = std::chrono::system_clock::now() WaitDuration;
Server->Shutdown(Deadline);
50 ms is for now selected arbitrarily as in documentation there is again no recommendations or best practices, so it's subject to change.