I wanted to compare execution time of my implementation of calculationg square root to native Math.sqrt()
in Node.js.
So I made some functions to test it:
function getDuration(func, argumentGenerator, multiplier = 1, argumentOffset = 0) {
let args = []
for (let i = 0; i < multiplier; i ) {
args.push(argumentGenerator(i argumentOffset))
}
let result = []
const start = performance.now()
for (let i = 0; i < multiplier; i ) {
result.push(func(args[i]))
}
const end = performance.now()
return {
time: end - start,
result: result,
}
}
function measureTime(func, repeat = 1, argumentGenerator, multiplier = 1) {
let result = []
for (let i = 0; i < repeat; i ) {
result.push(
getDuration(func, argumentGenerator, multiplier, i * multiplier)
);
}
return result
}
and the test subjects:
const indexArg = (i) => i * i 1;
let functionsToTest = [
[
function BOTH(x) {
return Math.sqrt(x) - NewtonSquareRoot(x)
},
1e3, indexArg, 1e4
],
[
NewtonSquareRoot,
1e3, indexArg, 1e4
],
[
Math.sqrt,
1e3, indexArg, 1e4
],
];
let results = {}
for (const fArg of functionsToTest) {
let result = measureTime(...fArg)
results[fArg[0].name] = MMMM(result.map(x => x.time))
}
console.table(results);
And the result is:
┌──────────────────┬────────┬────────┬────────┬────────┐
│ (index) │ min │ max │ mean │ median │
├──────────────────┼────────┼────────┼────────┼────────┤
│ BOTH │ 0.9142 │ 3.8853 │ 1.3225 │ 1.2812 │
│ NewtonSquareRoot │ 0.9435 │ 2.515 │ 1.6164 │ 1.6612 │
│ sqrt │ 0.1026 │ 0.9474 │ 0.1225 │ 0.107 │
└──────────────────┴────────┴────────┴────────┴────────┘
I would expect the "BOTH" function mean execution time to be closer to sum of 2 others, but it's even lower than execution of my implementation alone.
I noticed that rearrangement of functions results in lower time for the first function:
┌──────────────────┬────────┬────────┬────────┬────────┐
│ (index) │ min │ max │ mean │ median │
├──────────────────┼────────┼────────┼────────┼────────┤
│ NewtonSquareRoot │ 0.8823 │ 3.3975 │ 1.2753 │ 1.2644 │
│ sqrt │ 0.1025 │ 0.8317 │ 0.1325 │ 0.1067 │
│ BOTH │ 1.1295 │ 2.443 │ 1.55 │ 1.541 │
└──────────────────┴────────┴────────┴────────┴────────┘
┌──────────────────┬────────┬────────┬────────┬────────┐
│ (index) │ min │ max │ mean │ median │
├──────────────────┼────────┼────────┼────────┼────────┤
│ sqrt │ 0.0294 │ 1.7835 │ 0.062 │ 0.0329 │
│ NewtonSquareRoot │ 0.9475 │ 3.354 │ 1.6375 │ 1.6593 │
│ BOTH │ 1.1293 │ 2.4975 │ 1.5394 │ 1.5409 │
└──────────────────┴────────┴────────┴────────┴────────┘
("BOTH" is still lower in the last result for some reason) But it seems to me that if the function isn't first, the results are consistent.
Why does that happen? How can I make it consistent for 1st function also?
I was thinking about adding dummy to skip the first execution
measureTime((x) => x, 1, (i) => i, 32);
the results are consistent, but it doesn't look like an elegant solution.
This works only if the dummy repetition count is 32 or higher and I am lost.
CodePudding user response:
(V8 developer here.)
Why does that happen?
Because the func(args[i])
call gets inlined as long as it always calls the same function. Inlining avoids the call overhead, which is quite measurable for tiny functions.
Of course, real code would have exactly one sqrt
function, so inlining those calls is absolutely the right strategy for engines. It only becomes a "problem" in the artificial scenario that's typically created by simple benchmarking frameworks.
This is one of the most commonly encountered pitfalls of microbenchmarking.
How can I make it consistent for 1st function also? I was thinking about adding dummy to skip the first execution
Yes, that's the obvious solution. It's what we do in some of our own microbenchmarks (example). Keep in mind that this intentionally disables a particular optimization; whether results generated this way are meaningful at all depends on what you're interested in.
(Note: for the record, microbenchmarking pitfalls apply to our own tests just as much as to everyone else's. Care must be taken when drawing conclusions from the results. We always check generated machine code to make sure those microbenchmarks measure what we want them to measure! And they're not our primary tools for making decisions, they're just one of several techniques we employ to keep an eye on performance and catch unexpected regressions.)
As an alternative, you could copy the getDuration
function (and maybe measureTime
as well) for each function that you want to measure. That may feel even less elegant, but it's actually closer to what real-world code would do, because it doesn't need to circumvent engine heuristics.
The best solution would be to not rely on microbenchmarks at all. Instead, take a real application that performs lots of sqrt operations, measure its performance with Math.sqrt
, then plug in your custom replacement and measure performance again. That way, you'd see the actual impact in practice, and there wouldn't be a risk of microbenchmarking-specific artifacts polluting your results (and possibly leading you to the wrong conclusion about which of two options is faster).
This works only if the dummy repetition count is 32 or higher
Another heuristic. V8 only starts collecting type/call feedback for code that has been running for a while. This improves performance and reduces memory consumption for functions that don't run much. Needless to say, I wouldn't rely on the threshold always being 32, it may change with versions and circumstances.