From Fedora Project Wiki

Revision as of 14:14, 30 June 2022 by Daandemeyer (talk | contribs) (Reference https://hal.inria.fr/hal-02297690/document in relation to DWARF unwinding)

Add -fno-omit-frame-pointer to default compilation flags

This is a proposed Change for Fedora Linux.
This document represents a proposed Change. As part of the Changes process, proposals are publicly announced in order to receive community feedback. This proposal will only be implemented if approved by the Fedora Engineering Steering Committee.

Summary

Fedora will add -fno-omit-frame-pointer to the default C/C++ compilation flags, which will improve the effectiveness of profiling and debugging tools.

Owner

Current status

  • devel thread
  • FESCo issue: #2817
  • Tracker bug: <will be assigned by the Wrangler>
  • Release notes tracker: <will be assigned by the Wrangler>

Detailed Description

Credits to Mirek Klimos, whose internal note on stacktrace unwinding formed the basis for this change proposal (myreggg@gmail.com).

Any performance or efficiency work relies on accurate profiling data. Sampling profilers probe the target program's call stack at regular intervals and store the stack traces. If we collect enough of them, we can closely approximate the real cost of a library or function with minimal runtime overhead.

Stack trace capture what’s running on a thread. It should start with clone - if the thread was created via clone syscall - or with _start - if it’s the main thread of the process. The last function in the stack trace is code that CPU is currently executing. If a stack starts with [unknown] or any other symbol, it means it's not complete.

Unwinding

How does the profiler get the list of function names? There are two parts of it:

  1. Unwinding the stack - getting a list of virtual addresses pointing to the executable code
  2. Symbolization - translating virtual addresses into human-readable information, like function name, inlined functions at the address, or file name and line number.

Unwinding is what we're interested in for the purpose of this proposal. The important things are:

  • Data on stack is split into frames, each frame belonging to one function.
  • Right before each function call, the return address is put on the stack. This is the instruction address in the caller to which we will eventually return — and that's what we care about.
  • One register, called the "frame pointer" or "base pointer" register (RBP), is traditionally used to point to the beginning of the current frame. Every function should back up RBP onto the stack and set it properly at the very beginning.

The “frame pointer” part is achieved by adding push %rbp, mov %rsp,%rbp to the beginning of every function and by adding pop %rbp before returning. Using this knowledge, stack unwinding boils down to traversing a linked list:

https://i.imgur.com/P6pFdPD.png

Where’s the catch?

The frame pointer register is not necessary to run a compiled binary. It makes it easy to unwind the stack, and some debugging tools rely on frame pointers, but the compiler knows how much data it put on the stack, so it can generate code that doesn't need the RBP. Not using the frame pointer register can make a program more efficient:

  • We don’t need to back up the value of the register onto the stack, which saves 3 instructions per function.
  • We can treat the RBP as a general-purpose register and use it for something else.

Whether the compiler sets frame pointer or not is controlled by the -fomit-frame-pointer flag and the default is "omit", meaning we can’t use this method of stack unwinding by default.

To make it possible to rely on the frame pointer being available, we'll add -fno-omit-frame-pointer to the default C/C++ compilation flags. This will instruct the compiler to make sure the frame pointer is always available. This will in turn allow profiling tools to provide accurate performance data which can drive performance improvements in core libraries and executables.

Feedback

Potential performance impact

  • Meta builds all its libraries and executables with -fno-omit-frame-pointer by default. Internal benchmarks did not show significant impact on performance when omitting the frame pointer for two of our most performance intensive applications.
  • From https://hal.inria.fr/hal-02297690/document, a paper on DWARF unwinding, we find that Google also compiles all its internal critical software with frame pointers to ensure fast and reliable backtraces.
  • Firefox recently landed a change to preserve the frame pointer in all jitted code (https://bugzilla.mozilla.org/show_bug.cgi?id=1426134). No significant decrease in performance was observed.
  • Given that the kernel on Fedora already uses the ORC debuginfo format and this works well, we'll keep compiling the kernel without frame pointers since there's no benefits for profiling or debugging to be gained by compiling the kernel with frame pointers. This prevents any regressions in kernel performance such as those reported by https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u.

Should individual libraries or executables notice a significant performance degradation caused by including the frame pointer everywhere, these packages can opt-out on an individual basis as described in https://docs.fedoraproject.org/en-US/packaging-guidelines/#_compiler_flags.

Alternatives to frame pointers

There are a few alternative ways to unwind stacks instead of using the frame pointer:

  • DWARF data - The compiler can emit extra information that allows us to find the beginning of the frame without the frame pointer, which means we can walk the stack exactly as before. The problem is that we need to unwind the stack in kernel space which isn't implemented in the kernel. Given that the kernel implemented it's own format (ORC) instead of using DWARF, it's unlikely that we'll see a DWARF unwinder in the kernel any time soon. The perf tool allows you to use the DWARF data with --call-graph=dwarf, but this means that it copies the full stack on every event and unwinds in user space. This has very high overhead. For more details on why DWARF unwinding is slow, please see https://hal.inria.fr/hal-02297690/document which contains detailed information on the problems with DWARF unwinding.
  • ORC (undwarf) - problems with unwinding in kernel led to creation of another format with the same purpose as DWARF, just much simpler. This can only be used to unwind kernel stack traces; it doesn't help us with userspace stacks. More information on ORC can be found here.
  • LBR - New Intel CPUs have a feature that gives you source and target addresses for the last 16 (or 32, in newer CPUs) branches with no overhead. It can be configured to record only function calls and to be used as a stack, which means it can be used to get the stack trace. Sadly, you only get the last X calls, and not the full stack trace, so the data can be very incomplete. On top of that, many Fedora users might still be using CPUs without LBR support which means we wouldn't be able to assume working profilers on a Fedora system by default.
  • CTF Frame - An in progress RFC will add support to binutils to attach a new ctf_frame section to ELF binaries containing unwinding information. This new unwinding format claims to be more compact than eh_frame, faster to unwind, and simpler to implement an unwinder with. Should this format be accepted into binutils and should the kernel merge a CTF unwinder in the future, we could start building applications with CTF frame unwind information which could then be used in the kernel for unwinding userspace stacks instead of frame pointers. Unfortunately, CTF Frame is still a work-in-progress and won't be available for some time (if at all).

To summarize, if we want complete stacks with reasonably low overhead (which we do, there's no other way to get accurate profiling data from running services), frame pointers are currently the best option.

Benefit to Fedora

Implementing this change will provide profiling tools with easy access to stacktraces of installed libraries and executables which will lead to more accurate profiling data in general. This in turn can be used to implement optimizations to core libraries and executables which will improve the overall performance of Fedora itself and the wider Linux ecosystem.

Various debugging tools can also make use of the frame pointer to access the current stacktrace, although tools like gdb can already do this to some degree via embedded dwarf debugging info.

Scope

  • Proposal owners: Put up a PR to change the rpm macros to build packages by default with -fno-omit-frame-pointer by default.
  • Other developers: Review and merge the PR implementing the Change.
  • Policies and guidelines: N/A (not needed for this Change)
  • Trademark approval: N/A (not needed for this Change)
  • Alignment with Objectives: N/A

Upgrade/compatibility impact

This should not impact upgrades in any way.

How To Test

  1. Build the package with the updated rpm macros
  2. Profile the binary with perf record -g <binary>
  3. Inspect the perf data with perf report -g 'graph,0.5,caller'
  4. When expanding hot functions in the perf report, perf should show the full call graph of the hot function (at least for all functions that are part of the binary compiled with -fno-omit-frame-pointer)

User Experience

Fedora users will be more likely to have a streamlined experience when trying to debug/profile system executables/libraries. Tools such as perf will work out of the box instead of requiring to users to provide extra options (e.g. --call-graph=dwarf/LBR) or requiring users to recompile all relevant packages with -fno-omit-frame-pointer.

Dependencies

The rpm macros for Fedora need to be adjusted to include -fno-omit-frame-pointer in the default C/C++ compilation flags, and exclusions need to be added for performance-sensitive packages that don't benefit from being compiled with frame pointers.

The current list of packages that need to be excluded from this proposal is: - Any kernel packages

Contingency Plan

  • Contingency mechanism: The new version can be released without every package being rebuilt with fno-omit-frame-pointer. Profiling will only work perfectly once all packages have been rebuilt but there will be no regression in behavior if not all packages have been rebuilt by the time of the release. If the Change is found to introduce unacceptable regressions, the PR implementing it can be reverted and affected packages can be rebuilt.
  • Contingency deadline: Final freeze
  • Blocks release? No

Documentation

Release Notes

Packages are now compiled with frame pointers included by default. This will enable a variety of profiling and debugging tools to show more information out of the box.