Optimized Binaries for the AMD64 / x86_64 Architecture (v2)
Summary
Individual packages can provide already optimized libraries via the glibc-hwcaps mechanism. This approach will be extended to executables. The package provides an optimized variant of a binary in a different directory. A symlink to small program which replaces the binary in /usr/bin
. At runtime, this program will find the most appropriate variant and execute it.
Which packages provide the optimized code and at which level will be made by individual package maintainers based on benchmark results. A few programs/packages will be updated by the Change Owners to show how the mechanism works.
Owner
- Name: Zbigniew Jędrzejewski-Szmek
- Name: Michel Lind
- Name: José Relvas
- Emails: zbyszek@in.waw.pl, salimma@fedoraproject.org
Current status
- Targeted release: Fedora Linux 42
- Last updated: 2024-12-21
- [<link to devel-announce post will be added by Wrangler> Announced]
- [<will be assigned by the Wrangler> Discussion thread]
- FESCo issue: <will be assigned by the Wrangler>
- Tracker bug: <will be assigned by the Wrangler>
- Release notes tracker: <will be assigned by the Wrangler>
Detailed Description
This is an updated version of Changes/Optimized_Binaries_for_the_AMD64_Architecture.
Fedora binaries for the AMD64 / x86_64 architecture are compiled with code-generation flags that support almost all CPU variants. But newer generations of processors gained additional instructions that may be used to generate faster code. A vendor-independent x86-64 psABI supplement defines four "microachitecture levels": x86-64-v1
(the baseline, our code targets this), x86-64-v2
(+SSE3
, CentoOS targets this), x86-64-v3
(+AVX
), x86-64-v4
(+AVX512
) [1]. When code is compiled for a higher microarchitecture level it will crash (with SIGILL
, "illegal instruction") on CPUs which do not support it. Benchmark results show small differences in performance: usually in the range from -5% to 10%, with no discernible difference for most code, but some applications benefit, with gains of 120% in some benchmarks [e.g. 2, 4].
Over the years, various people have expressed interest in raising the required microarchitecture levels. But we have been very conservative in making changes, because support is missing in many older CPUs that are still in use, and in fact, even in some CPUs produced and sold today. By raising the required level we would make Fedora completely unusable on many machines. It also seems that recompiling all packages with the changed options would largely be a waste of resources, because for most code it makes no difference. But for some of the numerical or cryptographic code there are noticeable gains and it seems to be worth the effort to provide optimized code. This also makes Fedora more attractive to people interested in optimization.
The dynamic linker already has the glibc-hwcaps
mechanism to load optimized implementations of shared objects [3]. This means that packages can provide optimized libraries and they linker will be automatically load them from separate directories if appropriate. (For AMD64, this is /usr/lib64/glibc-hwcaps/x86-64-v{2,3,4}/
.)
This Change is about extending the glibc-hwcaps mechanism to executables. A small helper binary is provided. A program in /usr/bin
(or another path) is symlinked to this helper. When executed, the helper checks the capabilities of the CPU and searches for the most appropriate variant of the target program in a separate directory hierarchy. If then launches one of the optimized binaries or the "generic" one compiled for the baseline.
This means that individual packages "opt in", by moving their binary to the alternative directory hierarchy and replacing it by a symlink, and also providing one or more optimized variants.
Note: the ELF format provides the IFUNC mechanism to dynamically select a variant of a function (symbol) when an executable is loaded [5]. This is in particular used to load code using specific CPU instructions when those are supported. This mechanism is both more general (because it allows arbitrary selection criteria), more fine-grained (because there can be other variants than just a few fixed microarchitecture levels), and more efficient (because only the parts of the code that benefit from this need to be provided in multiple variants). In particular, glibc already makes extensive use of this to provide optimized code, which is then widely used by other libraries and programs. This means that even though we compile code in a way where the lowest baseline is supported, modern CPU instructions are already widely used. This is one of the reasons why compiling for a higher baseline often doesn't make any difference in benchmarks. The IFUNC mechanism or an equivalent mechanism should generally be preferred. Nevertheless, that needs to be implemented in the program or library itself, which is not trivial. The mechanism in this Proposal is intended for the code which do not use IFUNCs or some other similar mechanism.
[1] https://hackweek.opensuse.org/all/projects/support-glibc-hwcaps-and-micro-architecture-package-generation
[2] https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0002-march.rst
[3] https://sourceware.org/pipermail/libc-alpha/2021-February/122207.html
[4] https://blog.centos.org/2023/08/centos-isa-sig-performance-investigation/
[5] https://jasoncc.github.io/gnu_gcc_glibc/gnu-ifunc.html
Glibc-hwcaps together with the new helper provide a generic mechanism. It will be up to individual packages to actually provide code which makes use of it. Individual package maintainers are encouraged to benchmark their packages after recompilation, and provide the optimized variants if useful. (I.e. the code in question is measurably faster and the program is run often enough for this to make a difference.)
The Change Owners will implement the packaging changes for a few packages while developing the general mechanism and will submit those as pull requests. Other maintainers are asked to do the same for their packages if desired.
Optimized variants of programs and libraries MAY be packaged in a separate subpackage. The general packaging rules should be applied, i.e. a separate package or packages SHOULD be created if it is files are large enough.
Available benchmark results [2,4] are narrow and not very convincing. We should plan an evaluation of results after one release. If it turns out that the real gains are too small, we can scrap the effort. On the other hand, we should also consider other architectures. For example, microarchitecture levels z{14,15}
for s390x
or power{9,10}
for ppc64le
. Other architectures are not included in this Change Proposal to reduce its scope.
Feedback
Benefit to Fedora
The developers who are interested in this kind of optimization work can perform it within Fedora, without having to build separate repositories. The users who have the appropriate hardware will gain performance benefits. Faster code is also more energy-efficient. The change will be automatic and transparent to users.
Note that other distributions use higher microarchitecture levels. For example RHEL 9 uses x86-64-v2 as the baseline, RHEL 10 uses x86-64-v3, and other distros provide optimized variants (OpenSUSE, Arch Linux, Ubuntu). We implement the same change in Fedora in a way that is scoped more narrowly, and thus vastly cheaper in the sense of development effort, code compilation time, storage and distribution overhead, but should provide the same performance and energy benefits.
Scope
- Proposal owners:
- Package hwcaps-loader.
- Find some example packages to convert (the code must do "number crunching" or string processing, and must not already use IFUNCs or glibc-hwcaps or some other mechanism).
- Convert a few packages and submit the changes as pull requests.
- Submit a draft change to Packaging Guidelines
- Do benchmarks.
- Other developers:
- Consider converting some additional packages.
- Review and merge the Packaging Guidelines change
- Release engineering: #Releng issue number
- Policies and guidelines: N/A (not needed for this Change)
- Trademark approval: N/A (not needed for this Change)
- Alignment with the Fedora Strategy:
Upgrade/compatibility impact
Early Testing (Optional)
Do you require 'QA Blueprint' support? N
How To Test
- Install one of the converted packages
- Run the program. If the hardware supports the optimized variant, verify that it was ran. If the hardware does not support any of the optimized variants, verify that the baseline version was executed.
User Experience
The change should be invisible to users, except that some programs may execute more quickly.
Dependencies
Contingency Plan
- Contingency mechanism: Revert changes in individual packages. This can be either by the maintainers of those packages or by the Change Owners using provenpackager privileges.
- Contingency deadline: any time really. The changes are independent between packages, so we can trivially convert and uncovert individual programs even after release.
- Blocks release? No
Documentation
N/A (not a System Wide Change)