How does Clear Linux use AutoFDO?

I’m wondering if anyone can enlighten me on specifics for how Clear Linux applies AutoFDO (e.g. as mentioned in places like here as well as some news blogs (but since I’m a new poster I’m unable to post many links).

I haven’t been able to find anything so far with these kinds of details, but if anyone can point me to something online, that would be fantastic.

In particular:

  • When the distribution is built, when is the profile data collected? The big benefit of AutoFDO is that you don’t need a separate instrumentation build step, but I’m not sure I understand how you could collect profile data before the package is built, in the general case. The about page for Clear Linux suggests that compilation is done twice, so could similar optimization be done using non-Auto FDO (i.e. -fprofile-generate, -fprofile-use)?
  • Which packages get AutoFDO treatment? Is it done for all packages? If not, how are the packages chosen?
  • Does Clear Linux use the same steps documented in the GCC AutoFDO Tutorial (AutoFDO/Tutorial - GCC Wiki)?
  • Those tutorial steps say to pass the executable name to create_gcov, to produce the file needed for gcc’s -fauto-profile option. How does the build process know what executables need to have gcov files created, for each package being compiled? In particular, how does library code get AutoFDO’ed?
  • How is the workload chosen when collecting profile data?

Thanks for any clarification! If any of this information is actually already public, I’d appreciate a pointer to it.

In short, it doesn’t for packages (as far as I’m aware). FDO (-fprofile-generate, -fprofile-use) is significantly easier and more sane when creating profile data and builds during packaging (and in theory more accurate).

Profiling is generally done when there is a representative workload available to generate the profile (and it generates a performance improvement). Here’s a non-comprehensive list of packages I’ve seen with full profiling.

bzip2
gcc
libjpeg-turbo
libjpeg-turbo-soname8
libxml2
lua
opencv
openssl
p7zip
php
pixman
python
python3
R
zlib
zstd

The more favored optimization is building an avx2/avxx512 optimized package and a non-avx2 optimized package which gets used based on what your CPU supports. This is easy to do as it doesn’t require a workload for profiling!

In terms of AutoFDO, if you can use FDO instead, do that. Otherwise, the GCC tutorial looks correct. GitHub - google/autofdo: AutoFDO is where you’ll find create_gcov. It will profile all parts of the program that are run, including libraries. It all needs to boil down to one file passed to GCC, so the profile workload needs to run all binaries (I think gcov can be run and merged, but it all adds extra complexity). Only the parts that GCC is compiling when -fauto-profile is passed that match up with use in the workload will be profiled.

Use differs a lot between PGO for a distribution and an individual use case. PGO for a distribution needs to be generic and broad coverage (particularly with FDO). Parts that aren’t run in the profile workload can be built to minimize size at the expense of spped to reduce the program size and cache misses. So if your program has A, B and C workloads, profiling only A should make A faster at the expense of B and C. Bad for a distribution. If users only cares about A (or 90% A), then the tradeoff is a big win.

For a singular use case, your workload is the reason you are thinking of using AutoFDO, you want to make something faster. From what I’ve seen, AutoFDO is used more for larger programs and where the profile is collected in a live (or mock live) environment.

1 Like

Yes, I understand how PGO works in general, and that it can be difficult for a distribution to provide representative profile data. My questions are specifically about the mechanics of how Clear Linux applies PGO/FDO/AutoFDO.

Aside from the article I linked above, other materials provided by Clear Linux suggest that some form of PGO is used widely in the distribution:

Perhaps that impression is wrong, and PGO is only used sparingly across the distribution.

But even in that case I still want to know how the build system actually produces optimized binaries using PGO. For example, the CMake/autoconf/whatever build instructions for the upstream packages certainly aren’t trying to optimize with PGO. So that must be something that Clear Linux is specifically doing when it goes to build these packages.

That Clear-specific process is what I’m interested in. I could see how it could be done manually, but that sounds unrealistically painful even if only a modest part of the distribution is built with PGO.

Most of the magic happens in autospec - This is where we automatically build, run a payload, and then rebuild with the output profile, for instance. See GitHub - clearlinux/autospec: RPM packaging automation tool

Per above with autospec and if you go to GitHub - clearlinux-pkgs/zlib you can see the input files as an example that uses PGO. Once you determine a profile_payload for the package the process is the same (adding a couple of {C,CXX,F}FLAGS for the 1st build, run payload, add FLAGS to use profile data on 2nd build) and independent of the build system used (meson/autotools/cmake).

The final build spec is generated automatically by autospec using that payload information to generate the profile.

1 Like

I see, I think that fills in the blanks.

It sounds like vanilla PGO is typically used in Clear Linux then, not AutoFDO specifically.

So that probably means I originally misunderstood when I started believing that AutoFDO specifically is used to some large degree in Clear Linux.

Thanks!

I would say (at least in its current form), AutoFDO isn’t used at all. The AutoFDO package even looks to be deprecated since a couple of years ago. I tried using it myself once, but vanilla PGO seemed to work much better for package builds.

If you are into the performance stuff, there’s also Function-Multi Versioning which is talked about on old blogs (even quite recently mentioned here). Again from what I know it’s not been used for some time for building packages (you can use the technique to make FMV patches on CL though), with it much easier (and more performant) to build and install fully separate libraries with different optimizations (installed in /usr/lib64, /usr/lib64/haswell /usr/lib64/haswell/avx512_1) than to muck around with patches to each package.