JMH - How to correctly benchmark Thread Pools?-CodePudding

Please, read the newest EDIT of this question.

Issue: I need to write a correct benchmark to compare a different work using different Thread Pool realizations (also from external libraries) using different methods of execution to other work using other Thread Pool realizations and to a work without any threading.

For example I have 24 tasks to complete and 10000 random Strings in benchmark state:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
    @Param({"24"})
    int amountOfTasks;
    private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
    private String[] strs = new String[10000];

    @Setup
    public void setup() {
        for (int i = 0; i < strs.length; i  ) {
            strs[i] = String.valueOf(Math.random());
        }
    }
}

And two States as inner classes representing Work (string concat.) and ExecutorService setup and shutdown:

@State(Scope.Thread)
public static class Work {
    public String doWork(String[] strs) {
        StringBuilder conc = new StringBuilder();
        for (String str : strs) {
            conc.append(str);
        }
        return conc.toString();
    }
}

@State(Scope.Benchmark)
public static class ExecutorServiceState {
    ExecutorService service;

    @Setup(Level.Iteration)
    public void setupMethod() {
        service = Executors.newFixedThreadPool(tts);
    }

    @TearDown(Level.Iteration)
    public void downMethod() {
        service.shutdownNow();
        service = null;
    }
}

More strict question is: How to write correct benchmark to measure average time of doWork(); first: without any threading, second: using .execute() method and third: using .submit() method getting results of futures later. Implementation that I tried to wrote:

@Benchmark
public void noThreading(Work w, Blackhole bh) {
    for (int i = 0; i < amountOfTasks; i  ) {
        bh.consume(w.doWork(strs));
    }
}

@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
    for (int i = 0; i < amountOfTasks; i  ) {
         e.service.execute(() -> bh.consume(w.doWork(strs)));
    }
}

@Benchmark
public void noThreadingResult(Work w, Blackhole bh) {
    String[] strss = new String[amountOfTasks];
    for (int i = 0; i < amountOfTasks; i  ) {
        strss[i] = w.doWork(strs);
    }
    bh.consume(strss);
}

@Benchmark
public void executorServiceResult(ExecutorServiceState e, Work w, Blackhole bh) throws ExecutionException, InterruptedException {
    Future[] strss = new Future[amountOfTasks];
    for (int i = 0; i < amountOfTasks; i  ) {
        strss[i] = e.service.submit(() -> {return w.doWork(strs);});
    }
    for (Future future : strss) {
        bh.consume(future.get());
    }
}

After benchmarking this implementation on my PC (2 Cores, 4 threads) I got:

Benchmark                              (amountOfTasks)  Mode  Cnt         Score         Error  Units
ThreadPoolSamples.executorService                     24  avgt    3    255102,966 ± 4460279,056  ns/op
ThreadPoolSamples.executorServiceResult               24  avgt    3  19790020,180 ± 7676762,394  ns/op
ThreadPoolSamples.noThreading                         24  avgt    3  18881360,497 ±  340778,773  ns/op
ThreadPoolSamples.noThreadingResult                   24  avgt    3  19283976,445 ±  471788,642  ns/op

noThreading and executorService maybe correct (but i am still unsure) and noThreadingResult and executorServiceResult doesn't look correct at all.

EDIT:

I find out some new details, but i think the result is still incorrect: as answered user17280749 in this answer that the thread pool wasn't waiting for submitted tasks to complete, but there wasn't only one issue: javac also somehow optimises doWork() method in the Work class (prob the result of that operation was predictable by JVM), so for simplicity I used Thread.sleep() as "work" and also setted amountOfTasks new two params: "1" and "128" to demonstrate that on 1 task threading will be slower than noThreading, and 24 and 128 will be approx. four times faster than noThreading, also to the correctness of measurement I setted thread pools starting up and shutting down in benchmark:

package io.denery;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.util.concurrent.*;

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
    @Param({"1", "24", "128"})
    int amountOfTasks;
    private static final int tts = Runtime.getRuntime().availableProcessors() * 2;

    @State(Scope.Thread)
    public static class Work {
        public void doWork() {
            try {
                Thread.sleep(1);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }

    @Benchmark
    public void noThreading(Work w) {
        for (int i = 0; i < amountOfTasks; i  ) {
            w.doWork();
        }
    }

    @Benchmark
    public void fixedThreadPool(Work w)
            throws ExecutionException, InterruptedException {
        ExecutorService service = Executors.newFixedThreadPool(tts);
        Future[] futures = new Future[amountOfTasks];
        for (int i = 0; i < amountOfTasks; i  ) {
            futures[i] = service.submit(w::doWork);
        }
        for (Future future : futures) {
            future.get();
        }

        service.shutdown();
    }

    @Benchmark
    public void cachedThreadPool(Work w)
            throws ExecutionException, InterruptedException {
        ExecutorService service = Executors.newCachedThreadPool();
        Future[] futures = new Future[amountOfTasks];
        for (int i = 0; i < amountOfTasks; i  ) {
            futures[i] = service.submit(() -> {
                w.doWork();
            });
        }
        for (Future future : futures) {
            future.get();
        }

        service.shutdown();
    }
}

And the result of this benchmark is:

Benchmark                         (amountOfTasks)  Mode  Cnt          Score         Error  Units
ThreadPoolSamples.cachedThreadPool                1  avgt    3    1169075,866 ±   47607,783  ns/op
ThreadPoolSamples.cachedThreadPool               24  avgt    3    5208437,498 ± 4516260,543  ns/op
ThreadPoolSamples.cachedThreadPool              128  avgt    3   13112351,066 ± 1905089,389  ns/op
ThreadPoolSamples.fixedThreadPool                 1  avgt    3    1166087,665 ±   61193,085  ns/op
ThreadPoolSamples.fixedThreadPool                24  avgt    3    4721503,799 ±  313206,519  ns/op
ThreadPoolSamples.fixedThreadPool               128  avgt    3   18337097,997 ± 5781847,191  ns/op
ThreadPoolSamples.noThreading                     1  avgt    3    1066035,522 ±   83736,346  ns/op
ThreadPoolSamples.noThreading                    24  avgt    3   25525744,055 ±   45422,015  ns/op
ThreadPoolSamples.noThreading                   128  avgt    3  136126357,514 ±  200461,808  ns/op

We see that error doesn't really huge, and thread pools with task 1 are slower than noThreading, but if you compare 25525744,055 and 4721503,799 the speedup is: 5.406 and it is faster somehow than excpected ~4, and if you compare 136126357,514 and 18337097,997 the speedup is: 7.4, and this fake speedup is growing with amountOfTasks, and i think it is still incorrect. I think to look at this using PrintAssembly to find out is there are any JVM optimisations.

EDIT:

As mentioned user17294549 in this answer, I used Thread.sleep() as imitation of real work and it doesn't correct because:

for real work: only 2 tasks can run simultaneously on a 2-core system
for Thread.sleep(): any number of tasks can run simultaneously on a 2-core system

I remembered about Blackhole.consumeCPU(long tokens) JMH method that "burns cycles" and imitating a work, there is JMH example and docs for it. So I changed work to:

@State(Scope.Thread)
public static class Work {
    public void doWork() {
        Blackhole.consumeCPU(4096);
    }
}

And benchmarks for this change:

Benchmark                         (amountOfTasks)  Mode  Cnt         Score          Error  Units
ThreadPoolSamples.cachedThreadPool                1  avgt    3    301187,897 ±    95819,153  ns/op
ThreadPoolSamples.cachedThreadPool               24  avgt    3   2421815,991 ±   545978,808  ns/op
ThreadPoolSamples.cachedThreadPool              128  avgt    3   6648647,025 ±    30442,510  ns/op
ThreadPoolSamples.cachedThreadPool             2048  avgt    3  60229404,756 ± 21537786,512  ns/op
ThreadPoolSamples.fixedThreadPool                 1  avgt    3    293364,540 ±    10709,841  ns/op
ThreadPoolSamples.fixedThreadPool                24  avgt    3   1459852,773 ±   160912,520  ns/op
ThreadPoolSamples.fixedThreadPool               128  avgt    3   2846790,222 ±    78929,182  ns/op
ThreadPoolSamples.fixedThreadPool              2048  avgt    3  25102603,592 ±  1825740,124  ns/op
ThreadPoolSamples.noThreading                     1  avgt    3     10071,049 ±      407,519  ns/op
ThreadPoolSamples.noThreading                    24  avgt    3    241561,416 ±    15326,274  ns/op
ThreadPoolSamples.noThreading                   128  avgt    3   1300241,347 ±   148051,168  ns/op
ThreadPoolSamples.noThreading                  2048  avgt    3  20683253,408 ±  1433365,542  ns/op

We see that fixedThreadPool is somehow slower than the example without threading, and when amountOfTasks is bigger, then difference between fixedThreadPool and noThreading examples is smaller. What's happening in there? Same phenomenon I saw with String concatenation in the beginning of this question, but I didn't report it. (btw, thanks who read this novel and trying to answer this question you're really help me)

CodePudding user response：

See answers to this question to learn how to write benchmarks in java.

... executorService maybe correct (but i am still unsure) ...

Benchmark                              (amountOfTasks)  Mode  Cnt         Score         Error  Units
ThreadPoolSamples.executorService                     24  avgt    3    255102,966 ± 4460279,056  ns/op

It doesn't look correct like a correct result: the error 4460279,056 is 17 times greater than the base value 255102,966.

Also you have an error in:

@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
    for (int i = 0; i < amountOfTasks; i  ) {
         e.service.execute(() -> bh.consume(w.doWork(strs)));
    }
}

You submit the tasks to the ExecutorService, but doesn't wait for them to complete.

CodePudding user response：

Look at this code:

    @TearDown(Level.Iteration)
    public void downMethod() {
        service.shutdownNow();
        service = null;
    }

You don't wait for the threads to stop. Read the docs for details.
So some of your benchmarks might run in parallel with another 128 threads spawned by cachedThreadPool in a previous benchmark.

so for simplicity I used Thread.sleep() as "work"

Are you sure?
There is a big difference between real work and Thread.sleep():

for real work: only 2 tasks can run simultaneously on a 2-core system
for Thread.sleep(): any number of tasks can run simultaneously on a 2-core system