Webflux - hanging requests when using bounded elastic Scheduler-CodePudding

I have a service written with webflux that has high load (40 request per second) and I'm encountering a really bad latency and performance issues with behaviours I can't explain: at some point during peaks, the service hangs in random locations as if it doesn't have any threads to handle the request. The service does however have several calls to different service that aren't reactive - using WebClient, and another call to a main service that retrieves the main data through an sdk wrapped in Mono.fromCallable(..).publishOn(Schedulers.boundedElastic()).

So the flow is:

upon request such as Mono<Request>
convert to internal object Mono<RequestAggregator>
call GCP to get JWT token and then call some service to get data using webclient
call the main service using Mono.fromCallable(MainService.getData(RequestAggregator)).publishOn(Schedulers.boundedElastic())
call another service to get more data (same as 3)
call another service to get more data (same as 3)
do some manipulation with all the data and return a Mono<Response>

the webclient calls look something like that:

Mono.fromCallable(() -> GoogleService.getToken(account, clientId)
                .buildIapRequest(REQUEST_URL))
                .map(httpRequest -> httpRequest.getHeaders().getAuthorization())
                .flatMap(authToken -> webClient.post()
                        .uri("/call/some/endpoint")
                        .header(HttpHeaders.AUTHORIZATION, authToken)
                        .header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON_VALUE)
                        .header(HttpHeaders.ACCEPT, MediaType.APPLICATION_JSON_VALUE)
                        .body(BodyInserters.fromValue(countries))
                        .retrieve()
                        .onStatus(HttpStatus::isError, clientResponse -> {
                            log.error("{} got status code: {}",
                                    ERROR_MSG_ERROR, clientResponse.statusCode());
                            return Mono.error(new SomeWebClientException(STATE_ABBREVIATIONS_ERROR));
                        })
                        .bodyToMono(SomeData.class));

sometimes step 6 hangs for more than 11 minutes, and this service does not have any issues. It's not reactive but responses take ~400ms

Another thing worth mentioning is that MainService is a heavy IO operation that might take 1 minute or more.

I feel like a lot of request hangs on MainService and theren't any threads left for the other operations, does that make sense? if so, how does one solve something like that?

Can someone suggest any reason for this issue? I'm all out of ideas

CodePudding user response：

It's not possible to tell for sure without knowing the full application, but indeed the blocking IO operation is the most likely culprit.

Schedulers.boundedElastic(), as its name suggests, is bounded. By default the bound is "ten times the number of available CPU cores", so on a 2-core machine it would be 20. If you have more concurrent requests than the limit, the rest is put into a queue waiting for a free thread indefinitely. If you need more concurrency than that, you should consider setting up your own scheduler using Scheduler.fromExecutor with a higher limit.