I have a class which counts events. It looks like this:
public class Counter {
private static final long BUCKET_SIZE_NS = Duration.ofMillis(100).toNanos();
...
private long nextBucketNum() {
return clock.getTime() / BUCKET_SIZE_NS;
}
public void count() {
...
final long num = nextBucketNum();
...
}
...
}
If I remove static
modifier from the field (intending to make it a class parameter), the counting throughput degrades more than for 25% according to JMH report.
The generated bytecode for static
case:
INVOKEINTERFACE Clock.getTime ()J (itf)
GETSTATIC Counter.BUCKET_SIZE_NS : J
LDIV
And for non-static
one:
INVOKEINTERFACE Clock.getTime ()J (itf)
ALOAD 0
GETFIELD Counter.BUCKET_SIZE_NS : J
LDIV
Am I doing performance test wrong experiencing some sort of dead-code elimination or is it some legitimate micro-optimization at some level like JIT or Hyperthreading?
The difference exists both in single-theaded and multi-threaded benchmarks.
Environment:
JMH version: 1.34
VM version: JDK 1.8.0_161, Java HotSpot(TM) 64-Bit Server VM, 25.161-b12
macOS Monterey 12.2.1
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
CodePudding user response:
The JVM optimizes static final fields as true constants, but it doesn't do the same for instance fields. In theory, the code could be analyzed and proven to show that the field is always the same, but that's more complicated. In addition, final fields aren't treated as truly final because of the reflection backdoor. There's a Jira item which tracks this issue, but I cannot find it right now. Internally, the JDK uses a special @Stable
annotation to optimize accesses to final instance fields.
But even if you could use this annotation, extra analysis would still required to prove that the field is the same for all instances. In most cases, the code which assigns the field needs to be fully inlined for the analysis to work. What if the Duration.ofMillis
call was implemented to return a random number? Of course it's not, but without the analysis, how could the compiler be certain?
CodePudding user response:
There are 2 optimizations at play here:
- Constant folding: The static final field is pre-computed and written into the code blob (the end result of JIT compilation). This is will translate into a performance win compared to a memory load (when reading the field).
- Arithmetic simplification: When dividing by a potentially variable quantity, the compiler has to use a division instruction which is super expensive. When dividing by a constant, the compiler can come up with a cheaper alternative. This is particularly true when dividing (and multiplying) by powers of 2 which can be simplified into shift instructions.
To look further into this I would recommend you run your benchmark with perfasm
and see where the cycles went and what assembly code was generated.
Happy hunting!