Proper benchmarking is extremely difficult and time consuming. Unfortunately the incentives are around mass producing benchmarks for ad views without any focus on quality.
The PTS benchmarks has 3 styles of benchmarks and mixes between them.
- Benchmarks that use distro provided packages.
- Benchmarks that use pre-compiled binaries.
- Benchmarks that download and compile a program (at a fixed version) locally to run, ignoring benefits of rolling release.
Each have their own challenges and are useful for different purposes. One would expect that a distro comparison would use type 1 benchmarks and possibly type 2 only for gaming (which are generally binary via steam). Many of the tests in the PTS are type 3.
Some simple issues with benchmarks can be:
- Relying solely on a single test number without validation. Without benchmark validation it also makes it difficult to test different versions of software.
- Interpreting results requires understanding the underlying benchmark to draw accurate conclusions. Oddities in the results also need to be investigated further.
- Bias and weighting of the overall takeout results.
A few examples (not all were in this particular set but ones I’m familiar with, but the issues aren’t uncommon):
The zstd benchmark actually uses the system provided zstd. It makes no validation of compression size differences, only the time taken. So the best way to improve ‘performance’ on the test would be to patch zstd to make default compression the lowest and use maximum threads (both reduce compression, CL patches for maximum threads by default). Both of these aren’t really optimizations in the normal sense, but changes in default behaviour that are easily controlled by users on the command line. The commentary usually puts this down to CL’s higher compiler optimizations.
The lame benchmark downloads and compiles lame and encodes a file. The issue with this is that the build system provides no default flags to use. Not only does this compile programs with flags that don’t represent the distribution, but actually compiles it at the equivalent of -O0 (no optimization). CL avoids this by exporting the CFLAGS to the environment (+-fassociative-math
which I’ve not seen used in the distribution). So the benchmark compares an unoptimized build on other distros to an optimized one on CL. Yet it has been used in the comparison tests for ages with CL. It makes no sense as a benchmark as 1. lame isn’t even included in CL. 2. The builds for comparison have no reflection on the performance of the distro. And as I said before, that one result adds 10% to the overall performance improvement.
The numpy benchmark includes the line shelllines = ['#!/bin/sh', 'export OMP_NUM_THREADS=1', 'cd
dirname $0'] + shelllines
, fixing the thread count when using openmp. To get better results in the benchmark, not using openmp in BLAS can actually be beneficial to the result (and penalizes openmp use). So to improve ‘performance’ working around the benchmark is beneficial to improving PTS results.
Most of the encoder/compression tests download and compile the programs locally. This means that on all distributions they are compiled with -march=native
and the defaults are usually -O2/3
. Under valgrind this can show less than 2% of the benchmark time used in libraries that are shipped with the distribution. These cases bypass much of the optimization work done in CL with providing AVX2/AVX512 optimized versions and PGO packages. Most of these fell out of favour as they didn’t show much difference between distro’s despite there being one!
For improvement, better benchmarks that actually reflect the distro performance, validation (something I don’t think the PTS can do) and actual analysis of the results. I have previously thrown together some really ugly scripts that were able to run the benchmarks, store values for results and one for validation (say file size for compression) and do a run under valgrind for later analysis. When doing a flac benchmark with GCC vs clang, the valgrind output made it obvious that clang couldn’t hit the SSE4 and AVX optimized functions that GCC would (due to the style which they were written). I believe it was fixed in git, but it showed why GCC gave better results other than it ‘optimized’ better.