https://www.phoronix.com/review/intel-ubuntu2404-fedora40
Here are some fresh benchmarks looking at how Ubuntu 24.04 LTS and Fedora Workstation 40 are competing with Intel’s in-house Clear Linux distribution that offers aggressive x86_64 Linux performance defaults and the best possible out-of-the-box Linux performance on modern x86_64 hardware.
Congratulations to the CL team!
Thanks for posting… Copied your link to the CL performance thread
Something not mentioned by reviewers is that Clear Linux defaults to CGROUPSv1 whereas most other Linux distributions default to CGROUPSv2. Some tests perform better with CGROUPSv1.
For fairer comparison with Clear Linux, systemd.unified_cgroup_hierarchy=0
is the command-line option to use CGROUPSv1. Moreover, CONFIG_CGROUP_RDMA
is disabled in Clear.
systemd.unified_cgroup_hierarchy=0 cgroup_disable=rdma
I ran schbench
five times while computing prime numbers on the CPU. Notice the difference between CGROUPSv2 and CGROUPSv1.
Ubuntu 24.04 CGROUPSv2 (stock kernel, no tuning)
$ ./schbench
Latency percentiles (usec)
50.0th: 1001 931 855 947 899
75.0th: 2828 2420 2428 2564 2468
90.0th: 4648 3924 3956 4036 3940
95.0th: 6552 5352 5496 5560 5240
*99.0th: 10448 9136 8688 9264 8208
99.5th: 11792 10832 9616 11024 9680
99.9th: 14704 14096 13232 14640 12976
max: 18364 20105 17489 19906 20140
Ubuntu 24.04 CGROUPSv2 (stock kernel, with tuning)
$ ./schbench
Latency percentiles (usec)
50.0th: 651 639 719 749 619
75.0th: 2212 2124 2212 2276 2156
90.0th: 3404 3172 3492 3700 3236
95.0th: 4824 4520 4856 5000 4520
*99.0th: 7336 6664 7272 7496 6888
99.5th: 8304 7592 8656 8464 7976
99.9th: 11888 10928 11792 10672 10832
max: 14881 14081 16048 12195 13832
Ubuntu 24.04 CGROUPSv1 (stock kernel, with tuning including boot args: systemd.unified_cgroup_hierarchy=0 cgroup_disable=rdma)
$ ./schbench
Latency percentiles (usec)
50.0th: 31 31 30 31 31
75.0th: 803 851 785 779 825
90.0th: 2140 2164 2132 2132 2148
95.0th: 2692 2732 2684 2676 2676
*99.0th: 4092 4184 4184 3988 4020
99.5th: 4920 4872 4904 4776 4020
99.9th: 5976 5800 5832 5864 5848
max: 8832 8388 7650 8824 7505
Ubuntu 24.04 CGROUPSv1 (rebuild kernel with config changes and -march=x86-64-v3 plus tuning including boot args: systemd.unified_cgroup_hierarchy=0 cgroup_disable=rdma)
$ ./schbench
Latency percentiles (usec)
50.0th: 25 25 25 26 26
75.0th: 675 683 575 657 657
90.0th: 1938 2034 1934 1914 1950
95.0th: 2532 2572 2524 2532 2556
*99.0th: 3940 3988 3972 3956 3956
99.5th: 4680 4824 4664 4824 4728
99.9th: 5560 5592 5528 5560 5512
max: 7647 7376 7749 7062 6914
Clear Linux CGROUPSv1 (ClearMod kernel, with tuning)
$ ./schbench
Latency percentiles (usec)
50.0th: 28 29 29 29 29
75.0th: 941 955 879 955 959
90.0th: 1738 1726 1710 1738 1758
95.0th: 2364 2380 2324 2340 2356
*99.0th: 3444 3460 3348 3228 3348
99.5th: 3764 3772 3708 3668 3668
99.9th: 4984 4840 4552 4568 4552
max: 7022 7223 6458 7061 6447
Interesting!
How did you tweak Ubuntu to have such improvements? Does the parameter you were talking about is a kernel argument ?
I wish to try it on Fedora.
Thanks !
It’s surprising to see that the Powersave option can compete with the Performance one. I’d like to have information about this (usage of watts, power profile by default, etc) but I do not find anything precise.
Yes, the kernel arguments mentioned. BTW, I added two more sections to the list.
- Ubuntu stock kernel, no tuning
- Rebuild kernel with config changes and
-march=x86-64-v3
Great! Thanks:
Probably too technical for me, but very interesting. Compiling the kernel seems gtoo hard for me.
Any way for you to benchmark this on Fedora ?
If i want to test by my self i only have to edit kernel parameters with :
systemd.unified_cgroup_hierarchy=0 cgroup_disable=rdma
and then rebuild my kernel? It seems so “easy” than i am afraid of not understand correctly!
Greetings from France!
Ogu
When building your kernel, only change the argument from -march=native
for -march=x86-64-v3
. More info here Kernel/Traditional compilation - ArchWiki
Yup, it really does sound easy. I don’t know how kernel compilation is on fedora though
Is it possible to run the same benchmarks on a cgroupsv2 activated system? Add these to the kernel parameters to really block cgroupsv1 operation:
systemd.unified_cgroup_hierarchy=true
systemd.legacy_systemd_cgroup_controller=false
cgroup_no_v1=all
There may be resource control configuration (cgroup related) differences between Ubuntu 24.04 and CL that may become evidence when comparing cgroupsv1 vs cgroupsv2. But for a more clear view we need to a test condition for CL under cgroupsv2.
There is a peer-reviewed paper showing that at least in terms of network latency, cgroups v2 is better than cgroups v1 and that can be related to better implementation:
We verify the claim that cgroups v2 has a more efficient implementation by measuring the number of instructions executed with the Linux performance analysis tool perf [18]. The measurement includes the container startup, 60 s packet forwarding at 1.52 Mpkt/s, and the shutdown. The experiment is repeated three times, taking the average value. The resulting data is presented in Table 1 and in the reproduction collection1 . We observe that for cgroups v1, the number of instructions executed is about 2.4 % higher, and about 2.2 % more conditional branches are executed compared to v2. The difference in process migrations of 148 in v2 and 297 in v1 is noteworthy. Given that we disabled scheduler load balancing, this finding is unexpected.
Links: https://www.perplexity.ai/search/is-cgroup-v2-slower-than-cgrou-UQTPxNw.SbOt5KCdQ6VjnQ
Clear Linux (ClearMod kernel, with tuning, HZ_800
). Results algorithm3
and schbench
running simultaneously.
CGROUPS v1
BORE: v5.2.0 v5.2.4
$ ./algorithm3.pl 2e12
Seconds: 36.968 36.489
$ ./schbench
Latency percentiles (usec)
50.0th: 29 26
75.0th: 841 601
90.0th: 1698 1374
95.0th: 2260 1942
*99.0th: 3380 2772
99.5th: 3764 3036
99.9th: 4936 3628
max: 8812 4438
Lagscope 4 million pings
9.0 seconds
CGROUPS v2
BORE: v5.2.0 v5.2.4
$ ./algorithm3.pl 2e12
Seconds: 34.550 34.302
$ ./schbench
Latency percentiles (usec)
50.0th: 929 821
75.0th: 2196 2042
90.0th: 3740 3316
95.0th: 4920 4248
*99.0th: 7512 6136
99.5th: 8496 7000
99.9th: 10736 8976
max: 14690 12235
Lagscope 4 million pings
9.2 seconds
Thank you. Coincidentally, I found this post of yours (Phoronix - updated benchmarks - #4 by marioroy) referring to cgroups
because I started looking more deeply into it last week. Considering that ClearLinux still uses cgroups v1
in systemd’s hybrid mode, I wondered if the CL staff had anything to say about this choice on the forums or in the github repositories, especially after reading that cgroups v2
, with its unified architecture, is supposedly superior in some respects. And that’s how I found your post.
These results are interesting. Do you know about systemd-cgls --no-pager
and systemd-cgtop --depth=20
? Can you look into their output to see how algorithm3
and schbench
are being launched with respect to their designated cgroups and whether they are being resource limited in any way?
This tool seems to be very good to view cgroups hierarchy and data.
rpm from Fedora 41 runs fine on CL. trying to build it from source is a dependency hell with a clang-16 dependency. below-0.8.1-1.fc41.x86_64 RPM
In the CPU tab, it shows the number of throttled events. This will only be useful for monitoring the cgroups v2
case because below
only works with cgroups v2
This is what Sonnet 3.5 said about the results:
The schbench results are particularly interesting because they show a significant performance difference favoring cgroups v1, which contradicts the supposed superiority of cgroups v2.
Let’s consider some possible explanations:
a) Workload characteristics: The performance benefits of cgroups v2 might be more pronounced in certain types of workloads. It’s possible that schbench exercises the system in a way that doesn’t benefit from cgroups v2’s improvements.
b) System configuration: The specific hardware and software configuration of your system could influence the results. For example, the kernel version, CPU architecture, and other system parameters might interact differently with cgroups v1 and v2.
c) Resource isolation: cgroups v2 provides better resource isolation, which could potentially lead to more conservative scheduling decisions in some cases. This might result in higher latencies but potentially better overall system stability or fairness.
d) Scheduler behavior changes: You specifically asked about changes in CFS (Completely Fair Scheduler) behavior with cgroups v2. Indeed, there are some differences:Unified hierarchy: cgroups v2 uses a unified hierarchy, which can change how resources are allocated and managed across different controllers.
Pressure stall information (PSI): cgroups v2 introduces PSI, which provides more detailed information about resource contention. This could influence scheduling decisions.
Improved throttling: cgroups v2 implements more sophisticated CPU throttling mechanisms, which might affect latency in certain scenarios.CFS behavior with cgroups v2:
While the core principles of CFS remain the same with cgroups v2, there are indeed some changes that could affect scheduling behavior:
a) Group scheduling: cgroups v2 changes how group scheduling works, potentially affecting how tasks are prioritized across different cgroups.
b) Bandwidth control: The new cpu.weight and cpu.weight.nice parameters in cgroups v2 replace the older cpu.shares, which could lead to different CPU allocation behaviors.
c) Latency optimization: In some cases, the CFS might make different trade-offs between latency and throughput when using cgroups v2, due to the more comprehensive resource management capabilities.
d) Interaction with other subsystems: The unified hierarchy in cgroups v2 means that CPU scheduling decisions might be more influenced by the state of other resources (memory, I/O) than in cgroups v1.Possible explanations for your results:
a) Scheduler tuning: Your system might be optimized for low-latency operations with cgroups v1, and these optimizations might not translate directly to cgroups v2.
b) Workload sensitivity: The schbench workload might be particularly sensitive to the changes in group scheduling or bandwidth control implemented in cgroups v2.
c) Resource contention: If your system was under different levels of resource contention during the two tests, it could explain the latency differences, as cgroups v2 might handle contention differently.
d) Measurement methodology: Ensure that the benchmarking methodology is consistent between the two tests and that no external factors are influencing the results.Recommendations:
a) Run more diverse benchmarks to see if the pattern holds across different types of workloads.
b) Analyze system metrics (CPU usage, context switches, memory usage) during the benchmarks to identify any significant differences between cgroups v1 and v2 runs.
c) Experiment with different cgroups v2 configurations to see if you can improve the latency performance.
d) Consider consulting with kernel developers or cgroups maintainers to get insights into why this specific workload might perform differently than expected.
In conclusion, while your results are indeed puzzling given the general consensus about cgroups v2 performance, they highlight the complexity of system performance and the importance of benchmarking specific workloads. The interaction between the CFS, cgroups, and specific workloads can lead to unexpected results, and further investigation would be needed to fully understand the cause of the performance differences you’re observing.
When taking the geometric mean of all the benchmarks conducted on each of the tested Linux distributions, Intel’s Clear Linux was around 27% faster than Ubuntu 24.04 out-of-the-box and Arch Linux. Switching to the Intel P-State “performance” governor from the default did help increase the performance as expected, but even then Clear Linux was still faster by 14%.