Phoronix - updated benchmarks

https://www.phoronix.com/review/intel-ubuntu2404-fedora40
Here are some fresh benchmarks looking at how Ubuntu 24.04 LTS and Fedora Workstation 40 are competing with Intel’s in-house Clear Linux distribution that offers aggressive x86_64 Linux performance defaults and the best possible out-of-the-box Linux performance on modern x86_64 hardware.

2 Likes

Congratulations to the CL team! :smiley:

2 Likes

Thanks for posting… Copied your link to the CL performance thread :wink:

1 Like

Something not mentioned by reviewers is that Clear Linux defaults to CGROUPSv1 whereas most other Linux distributions default to CGROUPSv2. Some tests perform better with CGROUPSv1.

For fairer comparison with Clear Linux, systemd.unified_cgroup_hierarchy=0 is the command-line option to use CGROUPSv1. Moreover, CONFIG_CGROUP_RDMA is disabled in Clear.

systemd.unified_cgroup_hierarchy=0 cgroup_disable=rdma

I ran schbench five times while computing prime numbers on the CPU. Notice the difference between CGROUPSv2 and CGROUPSv1.

Ubuntu 24.04 CGROUPSv2 (stock kernel, no tuning)

$ ./schbench
Latency percentiles (usec)
        50.0th:   1001     931     855     947     899
        75.0th:   2828    2420    2428    2564    2468
        90.0th:   4648    3924    3956    4036    3940
        95.0th:   6552    5352    5496    5560    5240
       *99.0th:  10448    9136    8688    9264    8208
        99.5th:  11792   10832    9616   11024    9680
        99.9th:  14704   14096   13232   14640   12976
           max:  18364   20105   17489   19906   20140

Ubuntu 24.04 CGROUPSv2 (stock kernel, with tuning)

$ ./schbench
Latency percentiles (usec)
        50.0th:    651     639     719     749     619
        75.0th:   2212    2124    2212    2276    2156
        90.0th:   3404    3172    3492    3700    3236
        95.0th:   4824    4520    4856    5000    4520
       *99.0th:   7336    6664    7272    7496    6888
        99.5th:   8304    7592    8656    8464    7976
        99.9th:  11888   10928   11792   10672   10832
           max:  14881   14081   16048   12195   13832

Ubuntu 24.04 CGROUPSv1 (stock kernel, with tuning including boot args: systemd.unified_cgroup_hierarchy=0 cgroup_disable=rdma)

$ ./schbench
Latency percentiles (usec)
        50.0th:     31      31      30      31      31
        75.0th:    803     851     785     779     825
        90.0th:   2140    2164    2132    2132    2148
        95.0th:   2692    2732    2684    2676    2676
       *99.0th:   4092    4184    4184    3988    4020
        99.5th:   4920    4872    4904    4776    4020
        99.9th:   5976    5800    5832    5864    5848
           max:   8832    8388    7650    8824    7505

Ubuntu 24.04 CGROUPSv1 (rebuild kernel with config changes and -march=x86-64-v3 plus tuning including boot args: systemd.unified_cgroup_hierarchy=0 cgroup_disable=rdma)

$ ./schbench
Latency percentiles (usec)
        50.0th:     25      25      25      26      26
        75.0th:    675     683     575     657     657
        90.0th:   1938    2034    1934    1914    1950
        95.0th:   2532    2572    2524    2532    2556
       *99.0th:   3940    3988    3972    3956    3956
        99.5th:   4680    4824    4664    4824    4728
        99.9th:   5560    5592    5528    5560    5512
           max:   7647    7376    7749    7062    6914

Clear Linux CGROUPSv1 (ClearMod kernel, with tuning)

$ ./schbench
Latency percentiles (usec)
        50.0th:     28      29      29      29      29
        75.0th:    941     955     879     955     959
        90.0th:   1738    1726    1710    1738    1758
        95.0th:   2364    2380    2324    2340    2356
       *99.0th:   3444    3460    3348    3228    3348
        99.5th:   3764    3772    3708    3668    3668
        99.9th:   4984    4840    4552    4568    4552
           max:   7022    7223    6458    7061    6447

Interesting!

How did you tweak Ubuntu to have such improvements? Does the parameter you were talking about is a kernel argument ?
I wish to try it on Fedora.

Thanks !

It’s surprising to see that the Powersave option can compete with the Performance one. I’d like to have information about this (usage of watts, power profile by default, etc) but I do not find anything precise.

Yes, the kernel arguments mentioned. BTW, I added two more sections to the list.

  • Ubuntu stock kernel, no tuning
  • Rebuild kernel with config changes and -march=x86-64-v3

Great! Thanks:

Probably too technical for me, but very interesting. Compiling the kernel seems gtoo hard for me.

Any way for you to benchmark this on Fedora ?

If i want to test by my self i only have to edit kernel parameters with :
systemd.unified_cgroup_hierarchy=0 cgroup_disable=rdma

and then rebuild my kernel? It seems so “easy” than i am afraid of not understand correctly!

Greetings from France!

Ogu

When building your kernel, only change the argument from -march=native for -march=x86-64-v3. More info here Kernel/Traditional compilation - ArchWiki

Yup, it really does sound easy. I don’t know how kernel compilation is on fedora though

Is it possible to run the same benchmarks on a cgroupsv2 activated system? Add these to the kernel parameters to really block cgroupsv1 operation:

systemd.unified_cgroup_hierarchy=true
systemd.legacy_systemd_cgroup_controller=false
cgroup_no_v1=all

There may be resource control configuration (cgroup related) differences between Ubuntu 24.04 and CL that may become evidence when comparing cgroupsv1 vs cgroupsv2. But for a more clear view we need to a test condition for CL under cgroupsv2.

There is a peer-reviewed paper showing that at least in terms of network latency, cgroups v2 is better than cgroups v1 and that can be related to better implementation:

We verify the claim that cgroups v2 has a more efficient implementation by measuring the number of instructions executed with the Linux performance analysis tool perf [18]. The measurement includes the container startup, 60 s packet forwarding at 1.52 Mpkt/s, and the shutdown. The experiment is repeated three times, taking the average value. The resulting data is presented in Table 1 and in the reproduction collection1 . We observe that for cgroups v1, the number of instructions executed is about 2.4 % higher, and about 2.2 % more conditional branches are executed compared to v2. The difference in process migrations of 148 in v2 and 297 in v1 is noteworthy. Given that we disabled scheduler load balancing, this finding is unexpected.

Links: https://www.perplexity.ai/search/is-cgroup-v2-slower-than-cgrou-UQTPxNw.SbOt5KCdQ6VjnQ

Clear Linux (ClearMod kernel, with tuning, HZ_800). Results algorithm3 and schbench running simultaneously.

CGROUPS v1

          BORE:  v5.2.0    v5.2.4

$ ./algorithm3.pl 2e12
       Seconds:  36.968    36.489

$ ./schbench
Latency percentiles (usec)
        50.0th:      29        26
        75.0th:     841       601
        90.0th:    1698      1374
        95.0th:    2260      1942
       *99.0th:    3380      2772
        99.5th:    3764      3036
        99.9th:    4936      3628
           max:    8812      4438

Lagscope 4 million pings
9.0 seconds

CGROUPS v2

          BORE:  v5.2.0    v5.2.4

$ ./algorithm3.pl 2e12
       Seconds:  34.550    34.302

$ ./schbench
Latency percentiles (usec)
        50.0th:     929       821
        75.0th:    2196      2042
        90.0th:    3740      3316
        95.0th:    4920      4248
       *99.0th:    7512      6136
        99.5th:    8496      7000
        99.9th:   10736      8976
           max:   14690     12235

Lagscope 4 million pings
9.2 seconds
1 Like

Thank you. Coincidentally, I found this post of yours (Phoronix - updated benchmarks - #4 by marioroy) referring to cgroups because I started looking more deeply into it last week. Considering that ClearLinux still uses cgroups v1 in systemd’s hybrid mode, I wondered if the CL staff had anything to say about this choice on the forums or in the github repositories, especially after reading that cgroups v2, with its unified architecture, is supposedly superior in some respects. And that’s how I found your post.

These results are interesting. Do you know about systemd-cgls --no-pager and systemd-cgtop --depth=20? Can you look into their output to see how algorithm3 and schbench are being launched with respect to their designated cgroups and whether they are being resource limited in any way?

This tool seems to be very good to view cgroups hierarchy and data.

rpm from Fedora 41 runs fine on CL. trying to build it from source is a dependency hell with a clang-16 dependency. below-0.8.1-1.fc41.x86_64 RPM

In the CPU tab, it shows the number of throttled events. This will only be useful for monitoring the cgroups v2 case because below only works with cgroups v2

This is what Sonnet 3.5 said about the results:

The schbench results are particularly interesting because they show a significant performance difference favoring cgroups v1, which contradicts the supposed superiority of cgroups v2.

Let’s consider some possible explanations:
a) Workload characteristics: The performance benefits of cgroups v2 might be more pronounced in certain types of workloads. It’s possible that schbench exercises the system in a way that doesn’t benefit from cgroups v2’s improvements.
b) System configuration: The specific hardware and software configuration of your system could influence the results. For example, the kernel version, CPU architecture, and other system parameters might interact differently with cgroups v1 and v2.
c) Resource isolation: cgroups v2 provides better resource isolation, which could potentially lead to more conservative scheduling decisions in some cases. This might result in higher latencies but potentially better overall system stability or fairness.
d) Scheduler behavior changes: You specifically asked about changes in CFS (Completely Fair Scheduler) behavior with cgroups v2. Indeed, there are some differences:

Unified hierarchy: cgroups v2 uses a unified hierarchy, which can change how resources are allocated and managed across different controllers.
Pressure stall information (PSI): cgroups v2 introduces PSI, which provides more detailed information about resource contention. This could influence scheduling decisions.
Improved throttling: cgroups v2 implements more sophisticated CPU throttling mechanisms, which might affect latency in certain scenarios.

CFS behavior with cgroups v2:

While the core principles of CFS remain the same with cgroups v2, there are indeed some changes that could affect scheduling behavior:
a) Group scheduling: cgroups v2 changes how group scheduling works, potentially affecting how tasks are prioritized across different cgroups.
b) Bandwidth control: The new cpu.weight and cpu.weight.nice parameters in cgroups v2 replace the older cpu.shares, which could lead to different CPU allocation behaviors.
c) Latency optimization: In some cases, the CFS might make different trade-offs between latency and throughput when using cgroups v2, due to the more comprehensive resource management capabilities.
d) Interaction with other subsystems: The unified hierarchy in cgroups v2 means that CPU scheduling decisions might be more influenced by the state of other resources (memory, I/O) than in cgroups v1.

Possible explanations for your results:

a) Scheduler tuning: Your system might be optimized for low-latency operations with cgroups v1, and these optimizations might not translate directly to cgroups v2.
b) Workload sensitivity: The schbench workload might be particularly sensitive to the changes in group scheduling or bandwidth control implemented in cgroups v2.
c) Resource contention: If your system was under different levels of resource contention during the two tests, it could explain the latency differences, as cgroups v2 might handle contention differently.
d) Measurement methodology: Ensure that the benchmarking methodology is consistent between the two tests and that no external factors are influencing the results.

Recommendations:

a) Run more diverse benchmarks to see if the pattern holds across different types of workloads.
b) Analyze system metrics (CPU usage, context switches, memory usage) during the benchmarks to identify any significant differences between cgroups v1 and v2 runs.
c) Experiment with different cgroups v2 configurations to see if you can improve the latency performance.
d) Consider consulting with kernel developers or cgroups maintainers to get insights into why this specific workload might perform differently than expected.
In conclusion, while your results are indeed puzzling given the general consensus about cgroups v2 performance, they highlight the complexity of system performance and the importance of benchmarking specific workloads. The interaction between the CFS, cgroups, and specific workloads can lead to unexpected results, and further investigation would be needed to fully understand the cause of the performance differences you’re observing.

When taking the geometric mean of all the benchmarks conducted on each of the tested Linux distributions, Intel’s Clear Linux was around 27% faster than Ubuntu 24.04 out-of-the-box and Arch Linux. Switching to the Intel P-State “performance” governor from the default did help increase the performance as expected, but even then Clear Linux was still faster by 14%.

https://www.phoronix.com/review/intel-xeon-6e-clear-linux