I have a question relating to the pow() function in Java's 17 new Vector API feature. I'm trying to implement the black scholes formula in a vectorized manner, but I'm having difficulty in obtaining the same performance as the scalar implementation
The code is as follows:
- I create an array of doubles (currently, just 5.0)
- I loop over elements of that array (different looping syntax for scalar and vector)
- I create DoubleVectors from the double arrays within and do calculations (or just calculations for scalar) I am trying to do e^(value), and I believe that is the problem
Here are some code snippets:
public static double[] createArray(int arrayLength)
{
double[] array0 = new double[arrayLength];
for(int i=0;i<arrayLength;i )
{
array0[i] = 2.0;
}
return array0;
}
@Param({"256000"})
int arraySize;
public static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;
DoubleVector vectorTwo = DoubleVector.broadcast(SPECIES,2);
DoubleVector vectorHundred = DoubleVector.broadcast(SPECIES,100);
double[] scalarTwo = new double[]{2,2,2,2};
double[] scalarHundred = new double[]{100,100,100,100};
@Setup
public void Setup()
{
javaSIMD = new JavaSIMD();
javaScalar = new JavaScalar();
spotPrices = createArray(arraySize);
timeToMaturity = createArray(arraySize);
strikePrice = createArray(arraySize);
interestRate = createArray(arraySize);
volatility = createArray(arraySize);
e = new double[arraySize];
for(int i=0;i<arraySize;i )
{
e[i] = Math.exp(1);
}
upperBound = SPECIES.loopBound(spotPrices.length);
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public void testVectorPerformance(Blackhole bh) {
var upperBound = SPECIES.loopBound(spotPrices.length);
for (var i=0;i<upperBound; i = SPECIES.length())
{
bh.consume(javaSIMD.calculateBlackScholesSingleCalc(spotPrices,timeToMaturity,strikePrice,
interestRate,volatility,e, i));
}
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public void testScalarPerformance(Blackhole bh) {
for(int i=0;i<arraySize;i )
{
bh.consume(javaScalar.calculateBlackScholesSingleCycle(spotPrices,timeToMaturity,strikePrice,
interestRate,volatility, i,normDist));
}
}
public DoubleVector calculateBlackScholesSingleCalc(double[] spotPrices, double[] timeToMaturity, double[] strikePrice,
double[] interestRate, double[] volatility, double[] e,int i){
...(skip lines)
DoubleVector vSpot = DoubleVector.fromArray(SPECIES, spotPrices, i);
...(skip lines)
DoubleVector powerOperand = vRateScaled
.mul(vTime)
.neg();
DoubleVector call = (vSpot
.mul(CDFVectorizedExcelOptimized(d1,vE)))
.sub(vStrike
.mul(vE
.pow(powerOperand))
.mul(CDFVectorizedExcelOptimized(d2,vE)));
return call;
Here are some JMH benchmarks (2 forks,2 warmups,2 iterations) on a Ryzen 5800X using WSL: Overall, it seems ~2x slower vs the scalar version. I ran a simple time before vs time after separately, of the method without JMH and it seems inline.
Result "blackScholes.TestJavaPerf.testScalarPerformance":
0.116 ±(99.9%) 0.002 ops/ms [Average]
89873915287 cycles:u # 4.238 GHz (40.43%)
242060738532 instructions:u # 2.69 insn per cycle
Result "blackScholes.TestJavaPerf.testVectorPerformance":
0.071 ±(99.9%) 0.001 ops/ms [Average]
90878787665 cycles:u # 4.072 GHz (39.25%)
254117779312 instructions:u # 2.80 insn per cycle
I also enabled diagnostic options for the JVM. I see the following:
"-XX: UnlockDiagnosticVMOptions", "-XX: PrintIntrinsics","-XX: PrintAssembly"
0x00007fe451959413: call 0x00007fe451239f00 ; ImmutableOopMap {rsi=Oop }
;*synchronization entry
; - jdk.incubator.vector.DoubleVector::arrayAddress@-1 (line 3283)
; {runtime_call counter_overflow Runtime1 stub}
0x00007fe451959418: jmp 0x00007fe4519593ce
0x00007fe45195941a: movabs $0x7fe4519593ee,%r10 ; {internal_word}
0x00007fe451959424: mov %r10,0x358(%r15)
0x00007fe45195942b: jmp 0x00007fe451193100 ; {runtime_call SafepointBlob}
0x00007fe451959430: nop
0x00007fe451959431: nop
0x00007fe451959432: mov 0x3d0(%r15),%rax
0x00007fe451959439: movq $0x0,0x3d0(%r15)
0x00007fe451959444: movq $0x0,0x3d8(%r15)
0x00007fe45195944f: add $0x40,%rsp
0x00007fe451959453: pop %rbp
0x00007fe451959454: jmp 0x00007fe451231e80 ; {runtime_call unwind_exception Runtime1 stub}
0x00007fe451959459: hlt
<More halts cut off>
[Exception Handler]
0x00007fe451959460: call 0x00007fe451234580 ; {no_reloc}
0x00007fe451959465: movabs $0x7fe46e76df9a,%rdi ; {external_word}
0x00007fe45195946f: and $0xfffffffffffffff0,%rsp
0x00007fe451959473: call 0x00007fe46e283d40 ; {runtime_call}
0x00007fe451959478: hlt
[Deopt Handler Code]
0x00007fe451959479: movabs $0x7fe451959479,%r10 ; {section_word}
0x00007fe451959483: push %r10
0x00007fe451959485: jmp 0x00007fe4511923a0 ; {runtime_call DeoptimizationBlob}
0x00007fe45195948a: hlt
<More halts cut off>
--------------------------------------------------------------------------------
============================= C2-compiled nmethod ==============================
** svml call failed for double_pow_32
@ 3 jdk.internal.misc.Unsafe::loadFence (0 bytes) (intrinsic)
@ 3 jdk.internal.misc.Unsafe::loadFence (0 bytes) (intrinsic)
@ 2 java.lang.Math::pow (6 bytes) (intrinsic)
Investigations/Questions:
- Im writing different implementations of the formula, it is not 1:1 - could this be the cause? Looking at the number of instructions according to JMH, there is roughly a 12billion difference in num of instructions. With vectorization the processor runs at a lower clock rate as well.
- Is the choice of input numbers a problem? I've tried i 10/(array.Length) as well.
- Is there a reason I see that the SVML call fail for double_pow_32 ? I don't see this problem for smaller input array sizes BTW
- I changed the pow to mul (for both,obviously the eq is now very different) but it seems to be much faster as a result, results are as expected scalar vs vector
Note: I believe it is using 256bit width vectors (checked during debugging)
CodePudding user response:
This might be related to JDK-8262275, Math vector stubs are not called for double64 vectors
For Double64Vector, the svml math vector stubs intrinsification is failing and they are not being called from jitted code.
But we do have svml double64 vectors.
You might try alternative operations, e.g. instead of vE.pow(powerOperand)
with vE
being a vector of e, you can use powerOperand.lanewise(VectorOperators.EXP)
to perform ex for all lanes.
Keep in mind that this API is work in progress in incubator state…