Summary
I am doing a million floating point divisions in different threads, these threads share nothing from programmer's point of view i.e. no explicit locks are involved.
Following are the perf numbers when I ran java -jar /tmp/my-exps-1.0-SNAPSHOT.jar 1000000 100
on machines having 8, 16 and 32 vcores (n/2 cores with 2 threads per core).
- 1000000 is number of floating point divisions that each thread does, 100 is number of threads.
- All the processors belonged to same family - Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz.
- Using htop, I was seeing a 100% usage for all the cores when the program was running.
====================================
runtime.availableProcessors() = 8
=====runBenchmark() FINISHED in 3156601 millis ======== average time to complete one thread = 31566.01
====================================
runtime.availableProcessors() = 16
=====runBenchmark() FINISHED in 3297807 millis ======== average time to complete one thread = 32978.07
====================================
runtime.availableProcessors() = 32
=====runBenchmark() FINISHED in 3448590 millis ======== average time to complete one thread = 34485.9
====================================
Expectation
I expected it to scale linearly with number of cores i.e execution time should decrease in proportion to increase in CPU cores. However the above numbers are totally opposite. It increases by a small fraction with increment in vcores :D
Not very sure but some guesses that I made
- Context switch time increases with number of cores ?
- Some implicit locks are taken on data and those become more expensive as the race increase with number of cores.
Code Details
Thread
private static class BenchmarkThread implements Callable<BenchmarkResult> {
private final long numOperationsPerThread;
private BenchmarkThread(long numOperationsPerThread) {
this.numOperationsPerThread = numOperationsPerThread;
}
@Override
public BenchmarkResult call() {
double sum = 0;
long start = System.currentTimeMillis();
for (long i = 0; i < numOperationsPerThread; i ) {
double numerator = RANDOM.nextDouble();
double denominator = RANDOM.nextDouble();
double result = numerator / denominator;
sum = result;
}
long end = System.currentTimeMillis();
return new BenchmarkResult(Thread.currentThread().getName(),
(end - start),
sum);
}
}
Driver (yes resultFuture.get()
blocks but we are not counting that time, we are doing a sum of individual thread times timeToComplete = benchmarkResult.timeToCompleteMillis
)
Complete runnable example(edited - see EDIT 1 below)
public class VerticalScalingExp {
private final long numOperationsPerThread;
private final int numThreads;
private final List<Future<BenchmarkResult>> benchMarkResultsFuture;
private final ExecutorService executorService;
public VerticalScalingExp(long numOperationsPerThread, int numThreads) {
this.numOperationsPerThread = numOperationsPerThread;
this.numThreads = numThreads;
this.benchMarkResultsFuture = new ArrayList<>(numThreads);
this.executorService = Executors.newFixedThreadPool(numThreads);
}
public static void main(String[] args) throws Exception {
long numOperationsPerThread;
int numThreads;
if (args.length != 2) {
numOperationsPerThread = 1000000;
numThreads = 50;
} else {
numOperationsPerThread = Long.parseLong(args[0]);
numThreads = Integer.parseInt(args[1]);
}
new VerticalScalingExp(numOperationsPerThread, numThreads).runBenchmark();
}
private void runBenchmark() throws Exception {
try {
System.out.println("[START]====VerticalScalingExp.runBenchmark====" );
System.out.println("numOperationsPerThread = " numOperationsPerThread ", numThreads = " numThreads);
Runtime runtime = Runtime.getRuntime();
System.out.println("runtime.maxMemory() = " runtime.maxMemory());
System.out.println("runtime.freeMemory() = " runtime.freeMemory());
System.out.println("runtime.availableProcessors() = " runtime.availableProcessors());
long timeToComplete = 0;
for (int i = 0; i < numThreads; i ) {
benchMarkResultsFuture.add(executorService.submit(new BenchmarkThread(numOperationsPerThread)));
}
for (Future<BenchmarkResult> resultFuture : benchMarkResultsFuture) {
BenchmarkResult benchmarkResult = resultFuture.get();
System.out.println("resultFuture.get() = " benchmarkResult);
timeToComplete = benchmarkResult.timeToCompleteMillis;
}
double avg = (double) timeToComplete / numThreads;
System.out.println("=====runBenchmark() FINISHED in " timeToComplete
" millis ======== average time to complete one thread = " avg );
} finally {
executorService.shutdown();
}
}
private static class BenchmarkThread implements Callable<BenchmarkResult> {
private final long numOperationsPerThread;
private BenchmarkThread(long numOperationsPerThread) {
this.numOperationsPerThread = numOperationsPerThread;
}
@Override
public BenchmarkResult call() {
double sum = 0;
long start = System.currentTimeMillis();
ThreadLocalRandom random = ThreadLocalRandom.current();
for (long i = 0; i < numOperationsPerThread; i ) {
double numerator = random.nextDouble();
double denominator = random.nextDouble();
double result = numerator / denominator;
sum = result;
}
long end = System.currentTimeMillis();
return new BenchmarkResult(Thread.currentThread().getName(),
(end - start),
sum);
}
}
private static class BenchmarkResult {
private final String threadName;
private final long timeToCompleteMillis;
private final double sum;
public BenchmarkResult(String threadName, long timeToCompleteMillis, double sum) {
this.threadName = threadName;
this.timeToCompleteMillis = timeToCompleteMillis;
this.sum = sum;
}
@Override
public String toString() {
return "BenchmarkResult{"
"threadName='" threadName '\''
", timeToCompleteMillis=" timeToCompleteMillis
", sum =" sum
'}';
}
}
}
Questions
- I found this blog - https://www.theserverside.com/feature/Why-Java-Applications-Fail-to-Scale-Linearly which explains very nicely about data collisions that happen with increase in number of cores, is the code above suffering from the same ? (But in my code I am not sharing anything among threads ?) If yes, at what level these collisions are happening, Heap ? CPU caches ? something else ?
- Is this a commonly observed pattern ? (perf does not scale linearly with cpu cores)
- What can be done to make it scale as expected ?
Apologies for the long post :) Thanks :)
EDIT 1:
As suggested in comments and answer, I tried using ThreadLocalRandom. Performance increases a lot as compared to previous cases but still, performance decreases with increase in cores.
====================================
runtime.availableProcessors() = 8
=====runBenchmark() FINISHED in 1683 millis ======== average time to complete one thread = 16.83
====================================
runtime.availableProcessors() = 16
=====runBenchmark() FINISHED in 6622 millis ======== average time to complete one thread = 66.22
====================================
runtime.availableProcessors() = 32
=====runBenchmark() FINISHED in 19924 millis ======== average time to complete one thread = 199.24
====================================
CodePudding user response:
This could be the cause of the problem:
private static final Random RANDOM = new Random();
Because this is contended between all threads.
Try a ThreadLocalRandom instead.
Also, I would use a more reliable benchmarking approach like JMH.
CodePudding user response:
I don't think you're measuring what you think you're measuring. You have 100 tasks and you measure how much time each task takes to finish. Suppose each takes 2sec. So if we execute them one after another it'll be 2sec * 100.
Now suppose you run them in 8 threads and 8 cores. This doesn't (ideally) change the amount of time each task takes, so you still have 2sec for each task. And you again have 2sec * 100 of summed time. But the overall execution time changes - it's (2sec * 100) / 8
because this summed time is now spread across 8 cores instead of 1.
So what you need to measure is the total time it takes for the program to run. Just measure it in runBenchmark()
method:
private void runBenchmark() throws Exception {
try {
long started = System.nanoTime();
for (int i = 0; i < numThreads; i )
benchMarkResultsFuture.add(executorService.submit(new BenchmarkThread(numOperationsPerThread)));
for (Future<BenchmarkResult> resultFuture : benchMarkResultsFuture)
resultFuture.get();
long timeToComplete = (System.nanoTime() - started) / 1000;
System.out.println("=====runBenchmark() FINISHED in " timeToComplete);
} finally {
executorService.shutdown();
}
}