Advice on Scaling OptaPlanner using Azure Functions-CodePudding

I am trying to lift an OptaPlanner project into the cloud as an Azure Function. My goal in this would be to enhance the scaling so that our company can process more solutions in parallel.

Background: We currently have a project running in a Docker container using the optaplanner-spring-boot-starter MVN package. This has been successful when limited to solving one solution at a time. However, we need to dramatically scale the system so that a higher number of solutions can be solved in a limited time frame. Therefore, I'm looking for a cloud-based solution for the extra CPU resources needed.

I created an Azure Function using the optaplanner-core MVN package and our custom domain objects for our existing solution as a proof of concept. The Azure Function uses an HTTP trigger, this seems to work to get a solution, but the performance is seriously degraded. I'm expecting to need to upgrade the consumption plan so that we can specify CPU and memory requirements. However, it appears that Azure is not scaling out additional instances as expected leading to OptaPlanner blocking itself.

Here is the driver of the code:

@FunctionName("solve")
public HttpResponseMessage run(
    @HttpTrigger(name = "req", methods = {HttpMethod.POST },authLevel = AuthorizationLevel.FUNCTION) 
    HttpRequestMessage<Schedule> request,
    final ExecutionContext context) {

    SolverConfig config = SolverConfig.createFromXmlResource("solverConfig.xml");
    
    //SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("2");
    //SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("10");
    //SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("400");
    SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("AUTO");
    
    SolverManager<Schedule, UUID> solverManager = SolverManager.create(config ,managerConfig);
    
    SolverJob<Schedule, UUID> solverJob = solverManager.solve(UUID.randomUUID(), problem);

    // This is a blocking call until the solving ends
    Schedule solution = solverJob.getFinalBestSolution();

    return request.createResponseBuilder(HttpStatus.OK)
        .header("Content-Type", "application/json")
        .body(solution)
        .build();
}

Question 1: Does anyone know how to set up Azure so that each HTTP call causes a scaling out of a new instance? I would like this to happen so that each solver isn't competing for resources. I have tried to configure this by setting FUNCTIONS_WORKER_PROCESS_COUNT=1 and maxConcurrentRequests=1. I have also tried changing OptaPlanners parallelSolverCount and moveThreadCount to different values without any noticeable differences.

Question 2: Should I be using Quarkus with Azure instead of the core MVN package? I've read that Geoffrey De Smet answered, "As for AWS Lambda (serverless): Quarkus is your friend".

I'm out of my element here as I haven't coded with Java for over 20 years AND I'm new to both Azure Functions and OptaPlanner. Any advice would be greatly appreciated.

Thanks!

CodePudding user response：

Consider using OptaPlanner's Quarkus integration to compile natively. That is better for serverless deployments because it dramatically reduces the startup time. The README of the OptaPlanner quickstarts that use Quarkus explain how.

By switching from OptaPlanner in plain java to OptaPlanner in Quarkus (which isn't a big difference), a few magical things will happen:

The parsing of solverConfig.xml with an XML parser won't happen at runtime during bootstrap, but at build time. If its in src/main/resources/solverConfig.xml, quarkus will automatically pick it up to configure the SolverManager to inject.
No reflection at runtime

You will want to start 1 run per dataset. So parallelSolverCount shouldn't be higher than 1 and no run should handle 2 datasets (even not sequentially). If a run gets 8000 cpuMillis, you can use moveThreadCount=4 for it to get better results faster. If it only gets a 1000 cpuMillis (= 1 core), don't use move threads. Verify a run gets enough memory.

CodePudding user response：

As for your Question 1, unfortunately, I don't have a solution for Azure Functions, but let me point you to a blogpost about running (and scaling) OptaPlanner workloads on OpenShift, which could address some of your concerns on the architecture level.

Scaling is only static for now (the number of replicas is specified manually), but it can be paired with KEDA to scale based on the number of pending datasets.

Important to note, the optaplanner-operator is only experimental at this point.