Add -fno-omit-frame-pointer to default compilation flags
Summary
Fedora will add -fno-omit-frame-pointer to the default C/C++ compilation flags, which will improve the effectiveness of profiling and debugging tools.
Owner
- Name: Daan De Meyer, Davide Cavalca, Andrii Nakryiko
- Email: daandemeyer@fb.com, dcavalca@fb.com, andriin@fb.com
Current status
- Targeted release: Fedora Linux 37
- Last updated: 2022-06-09
- FESCo issue: <will be assigned by the Wrangler>
- Tracker bug: <will be assigned by the Wrangler>
- Release notes tracker: <will be assigned by the Wrangler>
Detailed Description
Credits to Mirek Klimos, whose internal note on stacktrace unwinding formed the basis for this change proposal (myreggg@gmail.com).
Any performance or efficiency work relies on accurate profiling data. Sampling profilers probe the target program's call stack at regular intervals and store the stack traces. If we collect enough of them, we can closely approximate the real cost of a library or function with minimal runtime overhead.
Stack trace capture what’s running on a thread. It should start with clone - if the thread was created via clone syscall - or with _start - if it’s the main thread of the process. The last function in the stack trace is code that CPU is currently executing. If a stack starts with [unknown] or any other symbol, it means it's not complete.
Unwinding
How does the profiler get the list of function names? There are two parts of it:
- Unwinding the stack - getting a list of virtual addresses pointing to the executable code
- Symbolization - translating virtual addresses into human-readable information, like function name, inlined functions at the address, or file name and line number.
Unwinding is what we're interested in for the purpose of this proposal. The important things are:
- Data on stack is split into frames, each frame belonging to one function.
- Right before each function call, the return address is put on the stack. This is the instruction address in the caller to which we will eventually return — and that's what we care about.
- One register, called the "frame pointer" or "base pointer" register (RBP), is traditionally used to point to the beginning of the current frame. Every function should back up RBP onto the stack and set it properly at the very beginning.
The “frame pointer” part is achieved by adding push %rbp, mov %rsp,%rbp to the beginning of every function and by adding pop %rbp before returning. Using this knowledge, stack unwinding boils down to traversing a linked list:
https://i.imgur.com/P6pFdPD.png
Where’s the catch?
The frame pointer register is not necessary to run a compiled binary. It makes it easy to unwind the stack, and some debugging tools rely on frame pointers, but the compiler knows how much data it put on the stack, so it can generate code that doesn't need the RBP. Not using the frame pointer register can make a program more efficient:
- We don’t need to back up the value of the register onto the stack, which saves 3 instructions per function.
- We can treat the RBP as a general-purpose register and use it for something else.
Whether the compiler sets frame pointer or not is controlled by the -fomit-frame-pointer flag and the default is "omit", meaning we can’t use this method of stack unwinding by default.
To make it possible to rely on the frame pointer being available, we'll add -fno-omit-frame-pointer to the default C/C++ compilation flags. This will instruct the compiler to make sure the frame pointer is always available. This will in turn allow profiling tools to provide accurate performance data which can drive performance improvements in core libraries and executables.
Feedback
Potential performance impact
- Meta builds all its libraries and executables with -fno-omit-frame-pointer by default. Internal benchmarks did not show significant impact on performance when omitting the frame pointer for two of our most performance intensive applications.
- Firefox recently landed a change to preserve the frame pointer in all jitted code (https://bugzilla.mozilla.org/show_bug.cgi?id=1426134). No significant decrease in performance was observed.
- Kernel 4.8 frame pointer benchmarks by Suse showed 5%-10% regressions in some benchmarks (https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u)
Should individual libraries or executables notice a significant performance degradation caused by including the frame pointer everywhere, these packages can opt-out on an individual basis as described in https://docs.fedoraproject.org/en-US/packaging-guidelines/#_compiler_flags.
Alternatives to frame pointers
There are a few alternative ways to unwind stacks instead of using the frame pointer:
- DWARF data - The compiler can emit extra information that allows us to find the beginning of the frame without the frame pointer, which means we can walk the stack exactly as before. The problem is that we need to unwind the stack in kernel space which isn't implemented in the kernel. Given that the kernel implemented it's own format (ORC) instead of using DWARF, it's unlikely that we'll see a DWARF unwinder in the kernel any time soon. The perf tool allows you to use the DWARF data with --call-graph=dwarf, but this means that it copies the full stack on every event and unwinds in user space. This has very high overhead.
- ORC (undwarf) - problems with unwinding in kernel led to creation of another format with the same purpose as DWARF, just much simpler. This can only be used to unwind kernel stack traces; it doesn't help us with userspace stacks. More information on ORC can be found here.
- LBR - New Intel CPUs have a feature that gives you source and target addresses for the last 16 (or 32, in newer CPUs) branches with no overhead. It can be configured to record only function calls and to be used as a stack, which means it can be used to get the stack trace. Sadly, you only get the last X calls, and not the full stack trace, so the data can be very incomplete. On top of that, many Fedora users might still be using CPUs without LBR support which means we wouldn't be able to assume working profilers on a Fedora system by default.
To summarize, if we want complete stacks with reasonably low overhead (which we do, there's no other way to get accurate profiling data from running services), frame pointers are currently the best option.
Benefit to Fedora
Implementing this change will provide profiling tools with easy access to stacktraces of installed libraries and executables which will lead to more accurate profiling data in general. This in turn can be used to implement optimizations to core libraries and executables which will improve the overall performance of Fedora itself and the wider Linux ecosystem.
Scope
- Proposal owners: Put up a PR to change the rpm macros to build packages by default with -fno-omit-frame-pointer by default.
- Other developers: Review and merge the PR implementing the Change.
- Release engineering: #Releng issue number. A mass rebuild is required.
- Policies and guidelines: N/A (not needed for this Change)
- Trademark approval: N/A (not needed for this Change)
- Alignment with Objectives: N/A
Upgrade/compatibility impact
This should not impact upgrades in any way.
How To Test
- Build the package with the updated rpm macros
- Profile the binary with
perf record -g <binary>
- Inspect the perf data with
perf report -g 'graph,0.5,caller'
- When expanding hot functions in the perf report, perf should show the full call graph of the hot function (at least for all functions that are part of the binary compiled with -fno-omit-frame-pointer)
User Experience
Fedora users will be more likely to have a streamlined experience when trying to debug/profile system executables/libraries. Tools such as perf will work out of the box instead of requiring to users to provide extra options (e.g. --call-graph=dwarf/LBR) or requiring users to recompile all relevant packages with -fno-omit-frame-pointer.
Dependencies
The rpm macros for Fedora need to be adjusted to include -fno-omit-frame-pointer in the default C/C++ compilation flags.
Contingency Plan
- Contingency mechanism: The new version can be released without every package being rebuilt with fno-omit-frame-pointer. Profiling will only work perfectly once all packages have been rebuilt but there will be no regression in behavior if not all packages have been rebuilt by the time of the release. If the Change is found to introduce unacceptable regressions, the PR implementing it can be reverted and affected packages can be rebuilt.
- Contingency deadline: Final freeze
- Blocks release? No
Documentation
- Original proposal for in-kernel DWARF unwinder (rejected): https://lkml.org/lkml/2017/5/5/571
Release Notes
Packages are now compiled with frame pointers included by default. This will enable a variety of profiling and debugging tools to show more information out of the box.