How do you reason about fluctuations in benchmarking data?-CodePudding

Suppose you're trying to optimize a function and using some benchmarking framework (like Google Benchmark) for measurement. You run the benchmarks on the original function 3 times and see average wall clock time/CPU times of 100 ms, 110 ms, 90 ms. Then you run the benchmarks on the "optimized" function 3 times and see 80 ms, 95 ms, 105 ms. (I made these numbers up). Do you conclude that your optimizations were successful?

Another problem I often run into is that I'll go do something else and run the benchmarks later in the day and get numbers that are further away than the delta between the original and optimized earlier in the day (say, 80 ms, 85 ms, 75 ms for the original function).

I know there are statistical methods to determine whether the improvement is "significant". Do software engineers actually use these formal calculations in practice?

I'm looking for some kind of process to follow when optimizing code.

CodePudding user response：

Rule of Thumb

Minimum(!) of each series => 90ms vs 80ms
Estimate noise => ~ 10ms
Pessimism => It probably didn't get any slower.

Not happy yet?

Take more measurements. (~13 runs each)
Interleave the runs. (Don't measure 13x A followed by 13x B.)

Ideally you always randomize whether you run A or B next (scientific: randomized trial), but it's probably overkill. Any source of error should affect each variant with the same probability. (Like the CPU building up heat over time, or a background task starting after run 11.)
Go back to step 1.

Still not happy? Time to admit it that you've been nerd-sniped. The difference, if it exists, is so small that you can't even measure it. Pick the more readable variant and move on. (Or alternatively, lock your CPU frequency, isolate a core just for the test, quiet down your system...)

Explanation

Minimum: Many people (and tools, even) take the average, but the minimum is statistically more stable. There is a lower limit how fast your benchmark can run on a given hardware, but no upper limit much it can get slowed down by other programs. Also, taking the minimum will automatically drop the initial "warm-up" run.
Noise: Apply common sense, just glance over the numbers. If you look a the standard deviation, make that look very skeptical! A single outlier will influence it so much that it becomes nearly useless. (It's not a normal distribution, usually.)
Pessimism: You were really clever to find this optimization, you really want the optimized version to be faster! If it looks better just by chance, you will believe it. (You knew it!) So if you care about being correct, you must counter this tendency.

CodePudding user response：

Do you conclude that your optimizations were successful?

No. 3 runs is not enough especially due to the huge variation and the fact that some timings of the two groups are mixed once merged and sorted.

For small timings like this, the first run should be removed and at least dozens of runs should be performed. I would personally use at least hundreds of runs.

Do software engineers actually use these formal calculations in practice?

Only very few developers does advanced statistical analysis. It is often not needed to do something very formal when the gab before/after the target optimization is huge and the variation within groups is small.

For example, if your program is twice faster than before with a min-max variation of <5%, then you can quite safely say that the optimization is successful. That being said, it is sometimes not the case due to unexpected external factors (though it is very rare when the gap is so big).

If the result is not obvious, then you need to do some statistic basics. You need to compute the standard deviation, the mean and median time, remove the first run, interleave runs and use many runs (at least dozens). The distribution of the timings almost always follow a normal distribution due to the central limit theorem. It is sometimes a mixture distribution due to the threshold effects (eg. caching). You can plot the value to see that easily if you see some outliers in timings.

If there are threshold effects, then you need to apply an advanced statistical analysis but this is complex to do and generally it is not an expected behaviour. I is generally a sign that the benchmark is biased, there is a bug or a complex effect you have to consider during the analysis of the result anyway. Thus, I strongly advise you to fix/mitigate the problem before analysing the results in that case.

Assuming the timings follow a normal distribution, you can just check if the median is close to the mean and if the standard deviation is small compare to the gap between the mean.

A more formal way to do that is to compute the Student t-test and its associated p-value and check the significance of the p-value (eg. <5%). If there are more groups, An Anova can be used. If you are unsure about the distribution, you can apply non-parametric statistical tests like the Wilcoxon and Kruskal-Wallis tests (note that the statistical power of these test is not the same). In practice, doing such a formal analysis is time-consuming and it is generally not so useful compare to a naive basic check (using the mean and standard deviation) unless your modification impacts a lot of users or you plan to write research papers.

Keep in mind that using a good statistical analysis does not prevent biased benchmarks. You need to minimize the external factors that can cause biased results. One frequent bias is frequency scaling: the first benchmark can be faster than the second because of turbo-boost or it can be slower because the processor can take some time to reach a high frequency. Caches also plays a huge role in benchmark biases. There are many other factors that can cause biases in practice like the compiler/runtime versions, environment variables, configuration files, OS/driver updates, memory alignment, OS paging (especially on NUMA systems), the hardware (eg. thermal throttling), software bugs (it is not rare to find bugs by analysing strange performance behaviours), etc.

As a result, it is critical to make benchmarks as reproducible as possible (by fixing versions and reporting the environment parameters (as well as possibly run the benchmarks in a sandbox if you are paranoid and if it does not affect too much the timings). Software like Nix/Spack help for packaging, and containers like LXD, Docker could help for a more reproducible environment.

Many big software team use automated benchmarking to check the presence of performance regression. Tools can do the run properly and statistical analysis for you regularly. A good example is the Numpy team which use a package called Airspeed Velocity (see the results). The PyPy team also designed their own benchmarking tool. The Linux kernel also have benchmarking suite to check for regression (eg. PTS) and many company focusing on performance have such automated benchmarking tools (often home-made). There are many existing tools for that.

For more information about this topic, please give a look to the great Performance Matters presentation by Emery Berger.