Towards Robust Performance Measurement
When measuring the performance impact of changes to Swift, the CI bots run the benchmark suite two times: once to get the baseline on the current tip of the tree and again on the branch from the PR. For this, the
Benchmark_Driver
runs set of 334 performance tests (some are excluded because they are considered unstable) and gathers 20 samples per benchmark for optimized binary and 5 samples per benchmark for unoptimized binary. With ~1s per sample, this results in minimum of 4,6 hours to execute the benchmarks (not counting project compilation). This means that full benchmarks are requested rather rarely, I guess to not overburden the CI infrastructure. Recently reviewers switched to asking Swift CI toplease smoke benchmark
, which only gathers 3 samples per optimization level, reporting benchmark results in about 1 hour.Performance test results reported by CI on Github often show false regressions and improvements, forcing the reviewers to perform frequent judgment calls. Even though there are 137 benchmarks excluded from the pre-commit test because they were considered unstable, the Swift benchmark suite does not appear to be exactly stable…
Sometimes improved compiler optimizations kick in and eliminate main workload of an incorrectly written benchmark but nobody notices it for a long stretches of time. Usually the non-zero setup work masks the problem because it prevents measured time from dropping to zero. There is no publicly visible tracking of performance results from the benchmark suite over time that would help prevent this issue either.
Pavol Vaskovic (via Erica Sadun):
Given the fact Swift Benchmark Suite is a set of microbenchmarks, we are measuring effects that are manifested in microseconds. We can significantly increase the robustness of our measurement process using statistical methods. Necessary prerequisite is having a representative sample population of reasonable size. From the experiment analyzed in previous sections it is apparent that we can make the measurement process resilient to the effects of varying system load if the benchmarked workload stays in range of hundreds of milliseconds, up to few thousand. Above that it becomes impossible to separate the signal from noise on a heavily contested CPU.
By making the run time small, it takes less time to gather enough samples and their quality is better. By staying well under the 10 millisecond time slice we get more pristine samples and the samples that were interrupted by context switching are easier to identify. Excluding outliers makes our measurement more robust.
After these resilience preconditions are met, we can speed up the whole measurement process by running it in parallel on all available CPU cores. If we gather 10 independent one-second measurements on a 10 core machine, we can run the whole Benchmark Suite in 500 seconds, while having much better confidence in the statistical significance of the reported results!