NVIDIA and XanMod CL updates

Latency results: The XanMod kernels have PREEMPT preemption, enabled. Notice the average latency (microseconds) and completion time when comparing kernels.

Unsure which kernel to select? The XanMod 6.1.x kernel is quite fast. However, the XanMod 6.6.x kernel is snappier for the desktop environment. Applications launch faster. The BORE CPU scheduler is amazing. Of course, the Clear patches help too.

2 Likes

Thank you for you work. 550 on CL, is working great.

1 Like

The ClearMod project supports building Clear’s native kernel with BORE (Burst-Oriented Response Enhancer) CPU Scheduler, and enable preemption.

LOCALMODCONFIG=1 ./xm-build clear-preempt
./xm-install clear-preempt
sync

How this came about? Above, I saw a regression running Clear’s native kernel for latency testing. The max latency are lowest, but the average latency under load, and the time to compute prime numbers took the longest. It turns out, the Clear native kernel performs similar to the XanMod kernel.

I witnessed Clear’s native kernel 6.8.1 + BORE 5.0.1 + preemption. :blossom:

$ ls /lib/modules
6.1.69-1331.ltsprev  6.6.22-160.xmmain-preempt  6.8.1-160.xmclear-preempt

ClearMod project: I replaced HZ_750 to HZ_720, resolves compute regression versus HZ_750. Also, I replaced HZ_600 to HZ_625, improves hackbench versus HZ_600. So, no odd behavior with the new entries HZ_625, HZ_720, and HZ_800.

1 /  100 = 0.01
1 /  250 = 0.004
1 /  300 = 0.00333333333333333
1 /  500 = 0.002
1 /  625 = 0.0016
1 /  720 = 0.00138888888888889
1 /  800 = 0.00125
1 / 1000 = 0.001

How the Hz values came about? HZ_800 inspiration from computing 1 / 800 = 0.00125. That looks “graceful” and powerful. I searched the web to see if HZ_800 is used elsewhere. HZ_625 came from hamadmarri’s baby_linux project. That inspired me to decrease HZ_750 down to HZ_720.

Notice how max latency progressively decreases with higher Hz. Low averages for all three Hz values. This was testing 4 million pings (1 million per sender, concurrently).

HZ_625

Sender 1: Minimum = 3.000us, Maximum = 20484.250us, Average = 6.947us
Sender 2: Minimum = 3.000us, Maximum = 20484.000us, Average = 7.416us
Sender 3: Minimum = 3.250us, Maximum = 20493.500us, Average = 9.395us
Sender 4: Minimum = 3.250us, Maximum = 20495.750us, Average = 9.214us

HZ_720

Sender 1: Minimum = 3.000us, Maximum = 20365.250us, Average = 6.563us
Sender 2: Minimum = 3.000us, Maximum = 20364.500us, Average = 7.451us
Sender 3: Minimum = 3.250us, Maximum = 20372.250us, Average = 9.186us
Sender 4: Minimum = 3.000us, Maximum = 20371.250us, Average = 8.934us

HZ_800

Sender 1: Minimum = 2.750us, Maximum = 19873.000us, Average = 6.463us
Sender 2: Minimum = 3.000us, Maximum = 19879.500us, Average = 7.356us
Sender 3: Minimum = 3.000us, Maximum = 19882.250us, Average = 9.223us
Sender 4: Minimum = 3.250us, Maximum = 19881.750us, Average = 8.972us

HZ_1000

Sender 1: Minimum = 2.750us, Maximum = 19972.250us, Average = 7.050us
Sender 2: Minimum = 3.000us, Maximum = 19973.250us, Average = 7.440us
Sender 3: Minimum = 2.500us, Maximum = 19983.000us, Average = 9.332us
Sender 4: Minimum = 3.000us, Maximum = 19990.750us, Average = 9.120us

The ClearMod project defaults to HZ_800.

I removed XanMod LTS 6.1.y, Main 6.6.y, and RT 6.6.y variants. That leaves only XanMod Edge and Clear’s Native; both 6.8.y. This makes it more manageable, consuming less time to QA.

ClearMod release 165: Bump kernels to 6.8.2

  1. Update Clear native and XanMod edge kernels to 6.8.2.
  2. Enhance fetch script to acquire latest from kernel.org, if needed for Clear.
  3. Add kbuild generic x86_64 levels for Clear.
  4. Disable watermark boosting by default.
  5. Refactor update_curr(), entity_tick() in sched/fair.

Off-topic:

The GNOME 46.0 Xorg environment, particularly gnome-terminal is not happy. There is minimum 1 second delay (occurs randomly) getting output from commands; e.g. ls. The issue began with Clear 41280. Unfortunately, NVIDIA drivers have not yet reached reliability using Wayland, particularly Xwayland.

Clear 41270 is currently the last stable Xorg/Gnome environment for NVIDIA graphics. In general it’s advised to move away from Xorg anyway. Maybe, the next NVIDIA 555 driver will be better.

Someone recently (using Radeon 780M graphics, embedded in the APU) tried Clear 41300 and 41270 to no avail. Black screen, live desktop image. What does one say? I mentioned about a time when Clear Linux was reliable.

I built the Clear native kernel (2) without preemption + BORE, (3) preemption + BORE, and (4) one using HZ=800.

1. 6.8.2-1420.native          HZ=1000
2. 6.8.2-166.xmclear-default  HZ=1000  BORE 5.0.3
3. 6.8.2-166.xmclear-preempt  HZ=1000  BORE 5.0.3
4. 6.8.2-166.xmclear-preempt  HZ=800   BORE 5.0.3

Compute only: Running with idle attribute reaches non-preempt performance.

                       Clear Native    With BORE  Preempt+BORE Preempt+BORE
                           HZ=1000      HZ=1000      HZ=1000      HZ=800

$ ./algorithm3.pl 1e12     14.857s      14.770s      15.283s      15.244s
$ chrt -i 0 \
  ./algorithm3.pl 1e12     14.691s      14.671s      14.633s      14.644s

Next, four tasks running concurrently to capture latency results.

Xorg/GNOME: YouTube playback consumes lesser CPU on Xorg using NVIDIA.

Chromium Browser:  https://slowroads.io/
   Google Chrome:  https://www.youtube.com/watch?v=aqz-KE-bpKQ (1440p60 HD)

                       Clear Native    With BORE  Preempt+BORE Preempt+BORE
                           HZ=1000      HZ=1000      HZ=1000      HZ=800
$ chrt -i 0 \
  ./algorithm3.pl 2e12     37.579s      37.774s      38.539s      38.767s
$ ./schbench 
Latency percentiles (usec)
            50.0th:           37           30           32           32
            75.0th:          835          476          739          769
            90.0th:         1694         1102         2060         1806
            95.0th:         2372         1790         2676         2668
           *99.0th:         3884         2972         4264         3884
            99.5th:         4168         3340         4840         4424
            99.9th:         5368         4264         6024         5464
            min=0, max=     7120         6119         7607         7512

Apples-to-apples comparison to Clear Linux’s native kernel is BORE without preemption. Running background jobs? Either chrt -i 0 or preempt kernel is helpful for smooth Slow Roads demonstration.

HZ=800 performs better on my system for preempted kernels. For background jobs, running with idle attribute reaches non-preempt performance.

Results captured on an AMD Ryzen Threadripper 3970X machine.

ClearMod Simplification; Release 168.

  1. Single rpmbuild folder where the SPEC files reside.
  2. Four kernels; Clear, Edge, BORE, and ECHO (new).
  3. Rename kernels to shorter names, without preempt suffix.
  4. Keep only essential sched-fair patches beneficial for BORE.
  5. Rename xm-list-kernels to xm-kernels.
  6. Change Hz default to 800. Remove HZ_720.
  7. Bump kernels to 6.8.4.
  8. Build kernels in tmp folder.

Already using ClearMod? Boot into a Clear OS installed kernel. Run ./xm-uninstall all. Afterwards, you can git pull or re-clone the repository. The xm-uninstall script will continue to support removal for older kernels, though deprecated.

clear - Clear Linux native kernel + preemption
bore  - XanMod Edge kernel + preemption + BORE
echo  - XanMod Edge kernel + preemption + ECHO
edge  - XanMod Edge kernel + preemption

The fetch-src script takes no arguments, due to single rpmbuild folder. I renamed xm-list-kernels to xm-kernels.

./fetch-src
./xm-build bore | clear | echo | edge
./xm-install bore | clear | echo | edge [<release>]
./xm-uninstall bore | clear | echo | edge [<release>]
./xm-uninstall all
./xm-kernels

I built two kernels in little time, possible with LOCALMODCONFIG=1.

$ LOCALMODCONFIG=1 ./xm-build bore
$ LOCALMODCONFIG=2 ./xm-build echo

What does installation look like? I captured the output. The process is NVIDIA-aware and will build the NVIDIA drivers automatically via dkms.

$ ./xm-install bore
Installing linux-xmbore
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:linux-xmbore-license-6.8.4-168   ################################# [ 25%]
   2:linux-xmbore-6.8.4-168           ################################# [ 50%]
   3:linux-xmbore-extra-6.8.4-168     ################################# [ 75%]
   4:linux-xmbore-dev-6.8.4-168       ################################# [100%]
Building kernel drivers for NVIDIA graphics.
done.

$ ./xm-install echo
Installing linux-xmecho
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:linux-xmecho-license-6.8.4-168   ################################# [ 25%]
   2:linux-xmecho-6.8.4-168           ################################# [ 50%]
   3:linux-xmecho-extra-6.8.4-168     ################################# [ 75%]
   4:linux-xmecho-dev-6.8.4-168       ################################# [100%]
Building kernel drivers for NVIDIA graphics.
done.

The BORE and ECHO CPU schedulers are amazing. Reminder, do no install too many kernels to not fill your boot partition. Starting fresh is no problem. Simply boot into a Clear OS installed kernel and run ./xm-uninstall all.

Don’t we all go through phases? :smile: Thanks @Businux for your patience. I am running CL again because of the clearmod, and CL has an input-remapper bundle!

I installed CL just to try your ECHO kernel! Thank you for your hard work @marioroy . I do not know about the nitty-gritty of kernels but the gnome system monitor during a ‘finetunig a diffusion model’ gives a good idea of the differences between the vanilla kernel and the ECHO!

Vanilla

ECHO

1 Like

ClearMod Release 172.

  1. Bump kernel config to Clear 6.8.6, without xz.
  2. Remove xz compression in the SPEC files.

NVIDIA on Clear Update

  1. Bump the CUDA SDK for 550 to 12.4.1.
  2. Bump the 550 driver to 550.76.
  3. Bump the Beta Vulkan driver to 550.40.59.

ClearMod Release 173

New variants: bore-rt, clear-rt, echo-rt, and edge-rt with the Linux 6.8 real-time patch set. Reminder: Do not install too many kernels to not fill the boot partition. I uninstalled no longer needed XanMod kernels.

The real-time variants have a greater responsibility for sustaining low latency. It shows here in “get properties”. That involves threads doing mutex locking for inserting/updating shared map containers. Testing was captured on a 32-core (64-threads) machine.

Edit: Here, the difference between non-RT and RT is due to approximately 183 million long string allocations taking longer on RT. I made a new version (GitHub Gist URL) allocating memory dynamically, in chunks. That resolved the issue on RT.

# XanMod 6.8.7 testing using Long List is Long (LLiL) benchmark.
# https://gist.github.com/marioroy/693d952b578792bf090fe20c2aaccad5

$ ./llil4map long* long* long* | cksum
llil4map start
use OpenMP            BORE      ECHO      EDGE     BORE-RT   ECHO-RT   EDGE-RT
use boost sort
get properties       11.980s   11.836s   12.041s   16.938s   16.744s   17.192s
map to vector         1.245s    1.292s    1.224s    1.341s    1.274s    1.360s
vector stable sort   10.306s   11.025s   10.334s   10.327s   11.665s   10.466s
write stdout          1.882s    1.364s    1.817s    2.651s    1.581s    2.455s
total time           25.416s   25.519s   25.417s   31.260s   31.265s   31.475s
    count lines    970195200
    count unique   295755152
29612263 5038456270

Note: One cannot use clang++ to build the LLiL demonstration. LLVM OpenMP support was dropped in CL 39970. The following are the minimum files missing for OpenMP support on the CPU.

Removing extra files under /usr
 -> Extra file: /usr/lib64/libompd.so -> deleted
 -> Extra file: /usr/lib64/libomp.so -> deleted
 -> Extra file: /usr/lib64/libiomp5.so -> deleted
 -> Extra file: /usr/lib64/libarcher.so -> deleted
 -> Extra file: /usr/lib64/cmake/openmp/FindOpenMPTarget.cmake -> deleted
 -> Extra file: /usr/lib64/cmake/openmp/ -> deleted
 -> Extra file: /usr/lib64/clang/17/include/ompt.h -> deleted
 -> Extra file: /usr/lib64/clang/17/include/ompt-multiplex.h -> deleted
 -> Extra file: /usr/lib64/clang/17/include/omp.h -> deleted
 -> Extra file: /usr/lib64/clang/17/include/omp-tools.h -> deleted

For the LLiL examples, binaries built with clang++ or NVIDIA HPC nvc++ run faster versus g++. Locally, I built LLVM 17 installed to /opt/llvm-17 and restored the missing files.

2 Likes

Thank you, @marioroy. The echo kernel is perfect, but echo-rt crashes on KDE 6-Wayland when I open the Discover software centre. Echo-rt has the most intriguing lines on the gnome system monitor!!

I prefer Gnome over KDE, but Gnome won’t start in Wayland. I read that Gnome 46.1 has a fix for Nvidia gfx owners, but it apparently needs a newer Nvidia driver.

https://www.phoronix.com/news/GNOME-Mutter-46.1-Released

Thanks for the bug report, @marioroy. I have no knowledge of finding crash logs, but I will learn when I have a bit of free time. Yes, I am on the proprietary driver. Just the echo kernel without RT hits the sweet spot for me for now.

1 Like