Please, read the newest EDIT of this question.
Issue: I need to write a correct benchmark to compare a different work using different Thread Pool realizations (also from external libraries) using different methods of execution to other work using other Thread Pool realizations and to a work without any threading.
For example I have 24 tasks to complete and 10000 random Strings in benchmark state:
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
@Param({"24"})
int amountOfTasks;
private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
private String[] strs = new String[10000];
@Setup
public void setup() {
for (int i = 0; i < strs.length; i ) {
strs[i] = String.valueOf(Math.random());
}
}
}
And two States as inner classes representing Work (string concat.) and ExecutorService setup and shutdown:
@State(Scope.Thread)
public static class Work {
public String doWork(String[] strs) {
StringBuilder conc = new StringBuilder();
for (String str : strs) {
conc.append(str);
}
return conc.toString();
}
}
@State(Scope.Benchmark)
public static class ExecutorServiceState {
ExecutorService service;
@Setup(Level.Iteration)
public void setupMethod() {
service = Executors.newFixedThreadPool(tts);
}
@TearDown(Level.Iteration)
public void downMethod() {
service.shutdownNow();
service = null;
}
}
More strict question is: How to write correct benchmark to measure average time of doWork(); first: without any threading, second: using .execute() method and third: using .submit() method getting results of futures later. Implementation that I tried to wrote:
@Benchmark
public void noThreading(Work w, Blackhole bh) {
for (int i = 0; i < amountOfTasks; i ) {
bh.consume(w.doWork(strs));
}
}
@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
for (int i = 0; i < amountOfTasks; i ) {
e.service.execute(() -> bh.consume(w.doWork(strs)));
}
}
@Benchmark
public void noThreadingResult(Work w, Blackhole bh) {
String[] strss = new String[amountOfTasks];
for (int i = 0; i < amountOfTasks; i ) {
strss[i] = w.doWork(strs);
}
bh.consume(strss);
}
@Benchmark
public void executorServiceResult(ExecutorServiceState e, Work w, Blackhole bh) throws ExecutionException, InterruptedException {
Future[] strss = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i ) {
strss[i] = e.service.submit(() -> {return w.doWork(strs);});
}
for (Future future : strss) {
bh.consume(future.get());
}
}
After benchmarking this implementation on my PC (2 Cores, 4 threads) I got:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.executorService 24 avgt 3 255102,966 ± 4460279,056 ns/op
ThreadPoolSamples.executorServiceResult 24 avgt 3 19790020,180 ± 7676762,394 ns/op
ThreadPoolSamples.noThreading 24 avgt 3 18881360,497 ± 340778,773 ns/op
ThreadPoolSamples.noThreadingResult 24 avgt 3 19283976,445 ± 471788,642 ns/op
noThreading and executorService maybe correct (but i am still unsure) and noThreadingResult and executorServiceResult doesn't look correct at all.
EDIT:
I find out some new details, but i think the result is still incorrect: as answered user17280749 in this answer that the thread pool wasn't waiting for submitted tasks to complete, but there wasn't only one issue: javac also somehow optimises doWork() method in the Work class (prob the result of that operation was predictable by JVM), so for simplicity I used Thread.sleep() as "work" and also setted amountOfTasks new two params: "1" and "128" to demonstrate that on 1 task threading will be slower than noThreading, and 24 and 128 will be approx. four times faster than noThreading, also to the correctness of measurement I setted thread pools starting up and shutting down in benchmark:
package io.denery;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.util.concurrent.*;
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
@Param({"1", "24", "128"})
int amountOfTasks;
private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
@State(Scope.Thread)
public static class Work {
public void doWork() {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
@Benchmark
public void noThreading(Work w) {
for (int i = 0; i < amountOfTasks; i ) {
w.doWork();
}
}
@Benchmark
public void fixedThreadPool(Work w)
throws ExecutionException, InterruptedException {
ExecutorService service = Executors.newFixedThreadPool(tts);
Future[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i ) {
futures[i] = service.submit(w::doWork);
}
for (Future future : futures) {
future.get();
}
service.shutdown();
}
@Benchmark
public void cachedThreadPool(Work w)
throws ExecutionException, InterruptedException {
ExecutorService service = Executors.newCachedThreadPool();
Future[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i ) {
futures[i] = service.submit(() -> {
w.doWork();
});
}
for (Future future : futures) {
future.get();
}
service.shutdown();
}
}
And the result of this benchmark is:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.cachedThreadPool 1 avgt 3 1169075,866 ± 47607,783 ns/op
ThreadPoolSamples.cachedThreadPool 24 avgt 3 5208437,498 ± 4516260,543 ns/op
ThreadPoolSamples.cachedThreadPool 128 avgt 3 13112351,066 ± 1905089,389 ns/op
ThreadPoolSamples.fixedThreadPool 1 avgt 3 1166087,665 ± 61193,085 ns/op
ThreadPoolSamples.fixedThreadPool 24 avgt 3 4721503,799 ± 313206,519 ns/op
ThreadPoolSamples.fixedThreadPool 128 avgt 3 18337097,997 ± 5781847,191 ns/op
ThreadPoolSamples.noThreading 1 avgt 3 1066035,522 ± 83736,346 ns/op
ThreadPoolSamples.noThreading 24 avgt 3 25525744,055 ± 45422,015 ns/op
ThreadPoolSamples.noThreading 128 avgt 3 136126357,514 ± 200461,808 ns/op
We see that error doesn't really huge, and thread pools with task 1 are slower than noThreading, but if you compare 25525744,055 and 4721503,799 the speedup is: 5.406 and it is faster somehow than excpected ~4, and if you compare 136126357,514 and 18337097,997 the speedup is: 7.4, and this fake speedup is growing with amountOfTasks, and i think it is still incorrect. I think to look at this using PrintAssembly to find out is there are any JVM optimisations.
EDIT:
As mentioned user17294549 in this answer, I used Thread.sleep() as imitation of real work and it doesn't correct because:
for real work: only 2 tasks can run simultaneously on a 2-core system for Thread.sleep(): any number of tasks can run simultaneously on a 2-core system
I remembered about Blackhole.consumeCPU(long tokens) JMH method that "burns cycles" and imitating a work, there is JMH example and docs for it. So I changed work to:
@State(Scope.Thread)
public static class Work {
public void doWork() {
Blackhole.consumeCPU(4096);
}
}
And benchmarks for this change:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.cachedThreadPool 1 avgt 3 301187,897 ± 95819,153 ns/op
ThreadPoolSamples.cachedThreadPool 24 avgt 3 2421815,991 ± 545978,808 ns/op
ThreadPoolSamples.cachedThreadPool 128 avgt 3 6648647,025 ± 30442,510 ns/op
ThreadPoolSamples.cachedThreadPool 2048 avgt 3 60229404,756 ± 21537786,512 ns/op
ThreadPoolSamples.fixedThreadPool 1 avgt 3 293364,540 ± 10709,841 ns/op
ThreadPoolSamples.fixedThreadPool 24 avgt 3 1459852,773 ± 160912,520 ns/op
ThreadPoolSamples.fixedThreadPool 128 avgt 3 2846790,222 ± 78929,182 ns/op
ThreadPoolSamples.fixedThreadPool 2048 avgt 3 25102603,592 ± 1825740,124 ns/op
ThreadPoolSamples.noThreading 1 avgt 3 10071,049 ± 407,519 ns/op
ThreadPoolSamples.noThreading 24 avgt 3 241561,416 ± 15326,274 ns/op
ThreadPoolSamples.noThreading 128 avgt 3 1300241,347 ± 148051,168 ns/op
ThreadPoolSamples.noThreading 2048 avgt 3 20683253,408 ± 1433365,542 ns/op
We see that fixedThreadPool is somehow slower than the example without threading, and when amountOfTasks is bigger, then difference between fixedThreadPool and noThreading examples is smaller. What's happening in there? Same phenomenon I saw with String concatenation in the beginning of this question, but I didn't report it. (btw, thanks who read this novel and trying to answer this question you're really help me)
CodePudding user response:
See answers to this question to learn how to write benchmarks in java.
... executorService maybe correct (but i am still unsure) ...
Benchmark (amountOfTasks) Mode Cnt Score Error Units ThreadPoolSamples.executorService 24 avgt 3 255102,966 ± 4460279,056 ns/op
It doesn't look correct like a correct result:
the error 4460279,056
is 17 times greater than the base value 255102,966
.
Also you have an error in:
@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
for (int i = 0; i < amountOfTasks; i ) {
e.service.execute(() -> bh.consume(w.doWork(strs)));
}
}
You submit the tasks to the ExecutorService
, but doesn't wait for them to complete.
CodePudding user response:
Look at this code:
@TearDown(Level.Iteration)
public void downMethod() {
service.shutdownNow();
service = null;
}
You don't wait for the threads to stop. Read the docs for details.
So some of your benchmarks might run in parallel with another 128 threads spawned by cachedThreadPool
in a previous benchmark.
so for simplicity I used Thread.sleep() as "work"
Are you sure?
There is a big difference between real work and Thread.sleep()
:
- for real work: only 2 tasks can run simultaneously on a 2-core system
- for
Thread.sleep()
: any number of tasks can run simultaneously on a 2-core system