I ran into this scenario in the process of developing a lightweight library for asynchronous messaging. Trying to get an idea of the cost of creating lots of medium sized objects with short lifetimes, I wrote the following test:
import java.nio.ByteBuffer;
import java.util.Random;
public class MemPressureTest {
static final int SIZE = 4096;
static final class Bigish {
final ByteBuffer b;
public Bigish() {
this(ByteBuffer.allocate(SIZE));
}
public Bigish(ByteBuffer b) {
this.b = b;
}
public void fill(byte bt) {
b.clear();
for (int i = 0; i < SIZE; i) {
b.put(bt);
}
}
}
public static void main(String[] args) {
Random random = new Random(1);
Bigish tmp = new Bigish();
for (int i = 0; i < 3e7; i) {
tmp.fill((byte)random.nextInt(255));
}
}
}
On my laptop, with default GC settings, it runs in about 95 seconds:
/tmp$ time java -Xlog:gc MemPressureTest
[0.006s][info][gc] Using G1
real 1m35.552s
user 1m33.658s
sys 0m0.428s
This is where things get strange. I tweaked the program to allocate a new object for each iteration:
...
Random random = new Random(1);
for (int i = 0; i < 3e7; i) {
Bigish tmp = new Bigish();
tmp.fill((byte)random.nextInt(255));
}
...
In theory, this should add some small overhead, but none of the objects should ever be promoted out of Eden. At best, I'd expect the runtimes to be close to identical. However, this test completes in ~17 seconds:
/tmp$ time java -Xlog:gc MemPressureTest
[0.007s][info][gc] Using G1
[0.090s][info][gc] GC(0) Pause Young (Normal) (G1 Evacuation Pause) 23M->1M(130M) 1.304ms
[0.181s][info][gc] GC(1) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(130M) 0.870ms
[0.247s][info][gc] GC(2) Pause Young (Normal) (G1 Evacuation Pause) 76M->0M(130M) 0.844ms
[0.317s][info][gc] GC(3) Pause Young (Normal) (G1 Evacuation Pause) 75M->0M(130M) 0.793ms
[0.381s][info][gc] GC(4) Pause Young (Normal) (G1 Evacuation Pause) 75M->0M(130M) 0.859ms
[lots of similar GC pauses, snipped for brevity]
[16.608s][info][gc] GC(482) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.765ms
[16.643s][info][gc] GC(483) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.580ms
[16.676s][info][gc] GC(484) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.841ms
real 0m16.766s
user 0m16.578s
sys 0m0.576s
I ran both versions several times, with near identical results to the above. I feel like I must be missing something very obvious. Am I going insane? What could explain this difference in performance?
=== EDIT ===
I rewrote the test using JMH as per apangin and dan1st's suggestions:
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.nio.ByteBuffer;
import java.util.Random;
public class MemPressureTest {
static final int SIZE = 4096;
@State(Scope.Benchmark)
public static class Bigish {
final ByteBuffer b;
private Blackhole blackhole;
@Setup(Level.Trial)
public void setup(Blackhole blackhole) {
this.blackhole = blackhole;
}
public Bigish() {
this.b = ByteBuffer.allocate(SIZE);
}
public void fill(byte bt) {
b.clear();
for (int i = 0; i < SIZE; i) {
b.put(bt);
}
blackhole.consume(b);
}
}
static Random random = new Random(1);
@Benchmark
public static void test1(Blackhole blackhole) {
Bigish tmp = new Bigish();
tmp.setup(blackhole);
tmp.fill((byte)random.nextInt(255));
blackhole.consume(tmp);
}
@Benchmark
public static void test2(Bigish perm) {
perm.fill((byte) random.nextInt(255));
}
}
Still, the second test much slower:
> Task :jmh
# JMH version: 1.35
# VM version: JDK 18.0.1.1, OpenJDK 64-Bit Server VM, 18.0.1.1 2-6
# VM invoker: /Users/xxx/Library/Java/JavaVirtualMachines/openjdk-18.0.1.1/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/xxx/Dev/MemTests/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.xxx.MemPressureTest.test1
# Run progress: 0.00% complete, ETA 00:16:40
# Fork: 1 of 5
# Warmup Iteration 1: 2183998.556 ops/s
# Warmup Iteration 2: 2281885.941 ops/s
# Warmup Iteration 3: 2239644.018 ops/s
# Warmup Iteration 4: 1608047.994 ops/s
# Warmup Iteration 5: 1992314.001 ops/s
Iteration 1: 2053657.571 ops/s3s]
Iteration 2: 2054957.773 ops/sm 3s]
Iteration 3: 2051595.233 ops/sm 13s]
Iteration 4: 2054878.239 ops/sm 23s]
Iteration 5: 2031111.214 ops/sm 33s]
# Run progress: 10.00% complete, ETA 00:15:04
# Fork: 2 of 5
# Warmup Iteration 1: 2228594.345 ops/s
# Warmup Iteration 2: 2257983.355 ops/s
# Warmup Iteration 3: 2063130.244 ops/s
# Warmup Iteration 4: 1629084.669 ops/s
# Warmup Iteration 5: 2063018.496 ops/s
Iteration 1: 1939260.937 ops/sm 33s]
Iteration 2: 1791414.018 ops/sm 43s]
Iteration 3: 1914987.221 ops/sm 53s]
Iteration 4: 1969484.898 ops/sm 3s]
Iteration 5: 1891440.624 ops/sm 13s]
# Run progress: 20.00% complete, ETA 00:13:23
# Fork: 3 of 5
# Warmup Iteration 1: 2228664.719 ops/s
# Warmup Iteration 2: 2263677.403 ops/s
# Warmup Iteration 3: 2237032.464 ops/s
# Warmup Iteration 4: 2040040.243 ops/s
# Warmup Iteration 5: 2038848.530 ops/s
Iteration 1: 2023934.952 ops/sm 14s]
Iteration 2: 2041874.437 ops/sm 24s]
Iteration 3: 2002858.770 ops/sm 34s]
Iteration 4: 2039727.607 ops/sm 44s]
Iteration 5: 2045827.670 ops/sm 54s]
# Run progress: 30.00% complete, ETA 00:11:43
# Fork: 4 of 5
# Warmup Iteration 1: 2105430.688 ops/s
# Warmup Iteration 2: 2279387.762 ops/s
# Warmup Iteration 3: 2228346.691 ops/s
# Warmup Iteration 4: 1438607.183 ops/s
# Warmup Iteration 5: 2059319.745 ops/s
Iteration 1: 1112543.932 ops/sm 54s]
Iteration 2: 1977077.976 ops/sm 4s]
Iteration 3: 2040147.355 ops/sm 14s]
Iteration 4: 1975766.032 ops/sm 24s]
Iteration 5: 2003532.092 ops/sm 34s]
# Run progress: 40.00% complete, ETA 00:10:02
# Fork: 5 of 5
# Warmup Iteration 1: 2240203.848 ops/s
# Warmup Iteration 2: 2245673.994 ops/s
# Warmup Iteration 3: 2096257.864 ops/s
# Warmup Iteration 4: 2046527.740 ops/s
# Warmup Iteration 5: 2050379.941 ops/s
Iteration 1: 2050691.989 ops/sm 35s]
Iteration 2: 2057803.100 ops/sm 45s]
Iteration 3: 2058634.766 ops/sm 55s]
Iteration 4: 2060596.595 ops/sm 5s]
Iteration 5: 2061282.107 ops/sm 15s]
Result "com.xxx.MemPressureTest.test1":
1972203.484 ±(99.9%) 142904.698 ops/s [Average]
(min, avg, max) = (1112543.932, 1972203.484, 2061282.107), stdev = 190773.683
CI (99.9%): [1829298.786, 2115108.182] (assumes normal distribution)
# JMH version: 1.35
# VM version: JDK 18.0.1.1, OpenJDK 64-Bit Server VM, 18.0.1.1 2-6
# VM invoker: /Users/xxx/Library/Java/JavaVirtualMachines/openjdk-18.0.1.1/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/xxx/Dev/MemTests/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.xxx.MemPressureTest.test2
# Run progress: 50.00% complete, ETA 00:08:22
# Fork: 1 of 5
# Warmup Iteration 1: 282751.407 ops/s
# Warmup Iteration 2: 283333.984 ops/s
# Warmup Iteration 3: 293785.079 ops/s
# Warmup Iteration 4: 268403.105 ops/s
# Warmup Iteration 5: 280054.277 ops/s
Iteration 1: 279093.118 ops/s9m 15s]
Iteration 2: 282782.996 ops/s9m 25s]
Iteration 3: 282688.921 ops/s9m 35s]
Iteration 4: 291578.963 ops/s9m 45s]
Iteration 5: 294835.777 ops/s9m 55s]
# Run progress: 60.00% complete, ETA 00:06:41
# Fork: 2 of 5
# Warmup Iteration 1: 283735.550 ops/s
# Warmup Iteration 2: 283536.547 ops/s
# Warmup Iteration 3: 294403.173 ops/s
# Warmup Iteration 4: 284161.042 ops/s
# Warmup Iteration 5: 281719.077 ops/s
Iteration 1: 276838.416 ops/s10m 56s]
Iteration 2: 284063.117 ops/s11m 6s]
Iteration 3: 282361.985 ops/s11m 16s]
Iteration 4: 289125.092 ops/s11m 26s]
Iteration 5: 294236.625 ops/s11m 36s]
# Run progress: 70.00% complete, ETA 00:05:01
# Fork: 3 of 5
# Warmup Iteration 1: 284567.336 ops/s
# Warmup Iteration 2: 283548.713 ops/s
# Warmup Iteration 3: 294317.511 ops/s
# Warmup Iteration 4: 283501.873 ops/s
# Warmup Iteration 5: 283691.306 ops/s
Iteration 1: 283462.749 ops/s12m 36s]
Iteration 2: 284120.587 ops/s12m 46s]
Iteration 3: 264878.952 ops/s12m 56s]
Iteration 4: 292681.168 ops/s13m 6s]
Iteration 5: 295279.759 ops/s13m 16s]
# Run progress: 80.00% complete, ETA 00:03:20
# Fork: 4 of 5
# Warmup Iteration 1: 284823.519 ops/s
# Warmup Iteration 2: 283913.207 ops/s
# Warmup Iteration 3: 294401.483 ops/s
# Warmup Iteration 4: 283998.027 ops/s
# Warmup Iteration 5: 283987.408 ops/s
Iteration 1: 278014.618 ops/s14m 17s]
Iteration 2: 283431.491 ops/s14m 27s]
Iteration 3: 284465.945 ops/s14m 37s]
Iteration 4: 293202.934 ops/s14m 47s]
Iteration 5: 290059.807 ops/s14m 57s]
# Run progress: 90.00% complete, ETA 00:01:40
# Fork: 5 of 5
# Warmup Iteration 1: 285598.809 ops/s
# Warmup Iteration 2: 284434.916 ops/s
# Warmup Iteration 3: 294355.547 ops/s
# Warmup Iteration 4: 284307.860 ops/s
# Warmup Iteration 5: 284297.362 ops/s
Iteration 1: 283676.043 ops/s15m 57s]
Iteration 2: 283609.750 ops/s16m 7s]
Iteration 3: 284575.124 ops/s16m 17s]
Iteration 4: 293564.269 ops/s16m 27s]
Iteration 5: 216267.883 ops/s16m 37s]
Result "com.xxx.MemPressureTest.test2":
282755.844 ±(99.9%) 11599.112 ops/s [Average]
(min, avg, max) = (216267.883, 282755.844, 295279.759), stdev = 15484.483
CI (99.9%): [271156.731, 294354.956] (assumes normal distribution)
The JMH Blackhole should prevent code removal and the fact that JMH is now in charge of running separate iterations should prevent parallelization, right? Shouldn't Blackhole also stop the object from being confined to the stack? Also, wouldn't there be more variation between warmup iterations if hotspot were still doing a significant amount of optimization?
CodePudding user response:
Disclaimer
The following are just a theories, they might be completely wrong. I am neither a JIT nor a GC expert.
code removal
I think that the JIT just optimized away (some of) your code. If that is the case, it detected that you are not actually using the stored values and just removed the code allocating/filling the object. Things like JmH's black hole might help you with that.
Parallelization
It could also be the case that it parallelized your code. Since loop different iterations are independent of each other, it is possible to execute multiple iterations in parallel.
Stack allocations
Another possibility is that it detected that the object is stack confined and has a very narrow scope and is removed immediately. Because of that, it might have moved your object to the stack where it can be allocated/pushed and deallocated/popped quickly.
Closing thoughts
The JIT might always do unexpected things. Don't optimize prematurely and don't guess where your bottlenecks are. Measure your performance before and after any changes you do. Performance might not get lost you expect it to. This is also the case in other languages but especially in Java.
And, as apangin mentioned in the comments, you should really use JMH.
CodePudding user response:
Creating a new ByteBuffer right before filling, indeed helps JIT compiler to produce better optimized code, when you use relative put
methods, and here is why.
- JIT compilation unit is a method. HotSpot JVM does not perform a whole-program optimization, which is quite hard even in theory due to dynamic nature of Java and the open world runtime environment.
- When the JVM compiles
test1
method, buffer instantiation appears in the same compilation scope as filling:
Bigish tmp = new Bigish();
tmp.setup(blackhole);
tmp.fill((byte)random.nextInt(255));
The JVM knows everything about the created buffer: its exact size and its backing array, it knows the buffer has not been published yet, no other thread sees it. So, the JVM can agressively optimize the fill loop: vectorize it using AVX instructions and unroll it to set 512 bytes at a time:
0x000001cdf60c9ae0: mov %r9d,%r8d
0x000001cdf60c9ae3: movslq %r8d,%r9
0x000001cdf60c9ae6: add %r11,%r9
0x000001cdf60c9ae9: vmovdqu %ymm0,0x10(%rcx,%r9,1)
0x000001cdf60c9af0: vmovdqu %ymm0,0x30(%rcx,%r9,1)
0x000001cdf60c9af7: vmovdqu %ymm0,0x50(%rcx,%r9,1)
0x000001cdf60c9afe: vmovdqu %ymm0,0x70(%rcx,%r9,1)
0x000001cdf60c9b05: vmovdqu %ymm0,0x90(%rcx,%r9,1)
0x000001cdf60c9b0f: vmovdqu %ymm0,0xb0(%rcx,%r9,1)
0x000001cdf60c9b19: vmovdqu %ymm0,0xd0(%rcx,%r9,1)
0x000001cdf60c9b23: vmovdqu %ymm0,0xf0(%rcx,%r9,1)
0x000001cdf60c9b2d: vmovdqu %ymm0,0x110(%rcx,%r9,1)
0x000001cdf60c9b37: vmovdqu %ymm0,0x130(%rcx,%r9,1)
0x000001cdf60c9b41: vmovdqu %ymm0,0x150(%rcx,%r9,1)
0x000001cdf60c9b4b: vmovdqu %ymm0,0x170(%rcx,%r9,1)
0x000001cdf60c9b55: vmovdqu %ymm0,0x190(%rcx,%r9,1)
0x000001cdf60c9b5f: vmovdqu %ymm0,0x1b0(%rcx,%r9,1)
0x000001cdf60c9b69: vmovdqu %ymm0,0x1d0(%rcx,%r9,1)
0x000001cdf60c9b73: vmovdqu %ymm0,0x1f0(%rcx,%r9,1)
0x000001cdf60c9b7d: mov %r8d,%r9d
0x000001cdf60c9b80: add $0x200,%r9d
0x000001cdf60c9b87: cmp %r10d,%r9d
0x000001cdf60c9b8a: jl 0x000001cdf60c9ae0
- You use relative
put
method. It not only sets a byte in a ByteBuffer, but also updates theposition
field. Note that the above vectorized loop does not update the position in memory. JVM sets it just once after the loop - it is allowed to do so as long as nobody can observe an inconsistent state of the buffer. - Now try to publish ByteBuffer before filling:
Bigish tmp = new Bigish();
volatileField = tmp; // publish
tmp.setup(blackhole);
tmp.fill((byte)random.nextInt(255));
The loop optimization breaks; now the array bytes are filled one by one, and the position field is incremented accordingly.
0x000001829b18ca5c: nopl 0x0(%rax)
0x000001829b18ca60: cmp %r11d,%esi
0x000001829b18ca63: jge 0x000001829b18ce34 ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
; - java.nio.Buffer::nextPutIndex@10 (line 721)
; - java.nio.HeapByteBuffer::put@6 (line 209)
; - bench.MemPressureTest$Bigish::fill@22 (line 33)
; - bench.MemPressureTest::test1@28 (line 47)
0x000001829b18ca69: mov %esi,