Performance difference in arithmetic operations between static and non-static field-CodePudding

I have a class which counts events. It looks like this:

public class Counter {
    private static final long BUCKET_SIZE_NS = Duration.ofMillis(100).toNanos();
    ...

    private long nextBucketNum() {
        return clock.getTime() / BUCKET_SIZE_NS;
    }

    public void count() {
       ...
       final long num = nextBucketNum();
       ...
    }
    ...
}

If I remove static modifier from the field (intending to make it a class parameter), the counting throughput degrades more than for 25% according to JMH report.

The generated bytecode for static case:

 INVOKEINTERFACE Clock.getTime ()J (itf)
 GETSTATIC Counter.BUCKET_SIZE_NS : J
 LDIV

And for non-static one:

INVOKEINTERFACE Clock.getTime ()J (itf)
ALOAD 0
GETFIELD Counter.BUCKET_SIZE_NS : J
LDIV

Am I doing performance test wrong experiencing some sort of dead-code elimination or is it some legitimate micro-optimization at some level like JIT or Hyperthreading?

The difference exists both in single-theaded and multi-threaded benchmarks.

Environment:

JMH version: 1.34
VM version: JDK 1.8.0_161, Java HotSpot(TM) 64-Bit Server VM, 25.161-b12

macOS Monterey 12.2.1

Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

CodePudding user response：

The JVM optimizes static final fields as true constants, but it doesn't do the same for instance fields. In theory, the code could be analyzed and proven to show that the field is always the same, but that's more complicated. In addition, final fields aren't treated as truly final because of the reflection backdoor. There's a Jira item which tracks this issue, but I cannot find it right now. Internally, the JDK uses a special @Stable annotation to optimize accesses to final instance fields.

But even if you could use this annotation, extra analysis would still required to prove that the field is the same for all instances. In most cases, the code which assigns the field needs to be fully inlined for the analysis to work. What if the Duration.ofMillis call was implemented to return a random number? Of course it's not, but without the analysis, how could the compiler be certain?

CodePudding user response：

There are 2 optimizations at play here:

Constant folding: The static final field is pre-computed and written into the code blob (the end result of JIT compilation). This is will translate into a performance win compared to a memory load (when reading the field).
Arithmetic simplification: When dividing by a potentially variable quantity, the compiler has to use a division instruction which is super expensive. When dividing by a constant, the compiler can come up with a cheaper alternative. This is particularly true when dividing (and multiplying) by powers of 2 which can be simplified into shift instructions.

To look further into this I would recommend you run your benchmark with perfasm and see where the cycles went and what assembly code was generated.

Happy hunting!