From Fedora Project Wiki
(→‎Benchmarking of the performance impact: reference Andrii's analysis for python)
Line 126: Line 126:
* Redis benchmarks do not seem to be significantly impacted when built with frame pointers
* Redis benchmarks do not seem to be significantly impacted when built with frame pointers


Aside from the pyperformance benchmarks, the impact of building with frame pointers is limited on the benchmarks we performed. Our findinds on the impact of the Python benchmarks when CPython is built with frame pointers are discussed in [https://pagure.io/fesco/issue/2817#comment-826636 this comment].
Aside from the pyperformance benchmarks, the impact of building with frame pointers is limited on the benchmarks we performed. Our findings on the impact of the Python benchmarks when CPython is built with frame pointers are discussed in [https://pagure.io/fesco/issue/2817#comment-826636 this comment].


=== Alternatives to frame pointers ===
=== Alternatives to frame pointers ===

Revision as of 20:24, 5 January 2023

Add -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer to default compilation flags

Summary

Fedora will add -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer to the default C/C++ compilation flags, which will improve the effectiveness of profiling and debugging tools.

This Change will be implemented for Fedora Linux 38 and the Change authors and FESCo will evaluate whether to retain it by Fedora Linux 40. This Change will be implemented via a %_include_frame_pointers macro to allow packages to trivially opt-out of retaining frame pointers during compilation if needed. The Change owners kindly request packagers to track opt-outs in Bugzilla and blocking against our tracking bug so that regressions can be appropriately investigated (and hopefully resolved).

Owner

Current status

Detailed Description

Why perform full system profiling in production?

Credits to Mirek Klimos, whose internal note on stacktrace unwinding formed the basis for this description (myreggg@gmail.com).

Generally, when implementing optimizations after receiving a report on a performance issue, there are two hurdles a developer must overcome:

  • They have to recompile their program with sufficient debugging information to enable accurate and reliable profiling. Frame pointers are an example of such information.
  • They have to reproduce the scenario under which the software performed poorly.
  • They have to gather the necessary profiling data by running the recompiled program in the reproduced scenario.

After gathering the profiling data, the developer can use that data to guide possible optimizations. Usually, this ends being an iterative process, where a possible optimization is implemented, and the scenario is rerun with the recompiled program to measure the effects on performance.

When dealing with a single program without dependencies, recompiling the software, reproducing the scenario and gathering the profiling data might not be terribly hard to achieve. However, when dealing with a large program with many dependencies, either in the form of shared libraries or via IPC, recompiling all of these dependencies with debugging information, reproducing the exact scenario under which the performance issue occurs, and gathering all the profiling data from all the dependencies becomes a complicated exercise.

An interesting approach to avoid the above hurdles is to make sure we can do profiling of the entire system directly in production. This approach means we don't have to recompile our software, don't need to reproduce the scenario under which the software performs poorly, and gives us a single unified approach to gather profiling data for all the applications we're interested in. Naturally, this approach depends on being able to profile the entire system efficiently so that there's no noticeable impact on any running services.

Another requirement (unrelated to this proposal, but interesting nonetheless) is that we need logic to only enable profiling when it's interesting to do so. There's a few different options:

  • On demand profiling: Only start profiling when we receive an explicit request to do so.
  • Interval based continuous profiling: Profile for a specific amount of time every X seconds/minutes/hours/...
  • Trigger based profiling: Start profiling based on some predefined conditions, such as high CPU or memory usage.

If we agree that being able to do full system profiling in production is useful, the next section explains why we need frame pointers in all software running on the system to be able to do effective full system profiling.

How to do full system profiling

Probably the most prominent way to do full system profiling on Linux with low overhead is by using the perf sampling profiler. Sampling profilers like perf operate by "statistical profiling" (sampling). They take a sample every N events e.g. "cpu-cycles" to understand the statistical breakdown of time spent in functions or function callstacks executing on the CPU. perf has an accompanying Linux subsystem that allows it to take samples every N events with very low overhead. A perf sample can include all kinds of information, but for profiling, what we're typically interested in is the call stack of the programs that are currently executing.

To record samples for specific hardware/software events using the perf subsystem on Linux, developers can use the perf_event_open() system call. To have the recorded samples include the call stack, the PERF_SAMPLE_CALLCHAIN can be set in the config struct passed to perf_event_open(). For userspace stacks, the call stack can only be sampled in kernelspace if the userspace program and its dependencies are built with frame pointers. If frame pointers are not available, PERF_SAMPLE_STACK_USER can be used to sample the entire stack instead, allowing for unwinding in userspace instead using e.g. DWARF debugging info.

The perf subsystem also has support for attaching BPF programs to a perf event fd. The program will be called every time an event is sampled and is provided the perf event data. This can be used to attach arbitrary logic to perf sampling, and makes it possible to implement custom logic on top of perf's sampling without having to leave kernelspace. To get access to the userspace stack from BPF, bpf provides the bpf_get_stackid() helper function. Similar to PERF_SAMPLE_CALLCHAIN, this function depends on the userspace program and its dependencies to have been compiled with frame pointers for it to be able to traverse the call stack.

To get accurate profiling results, we want to be able to sample events at a relatively high sampling rate. This means that we want to do the minimal amount of work every time we sample an event to avoid overhead. Traversing a stack using frame pointers is cheap, since we only have to traverse the frame pointers until we reach the top of the stack. In comparison, to unwind using DWARF, we first have to copy the full stack from kernelspace to userspace, and then unwind the stack using DWARF debugging info, which is relatively slow (see https://fzn.fr/projects/frdwarf/frdwarf-oopsla19.pdf). Because of this, to make full system profiling using perf work effectively, it's imperative that all software running on the system is compiled with frame pointers so that the call stack can be unwound in kernelspace for minimum overall overhead.

Frame pointers for debugging and tracing with BPF

The above profiling example was just one use case where we benefit from having access to the frame pointer in BPF. Since frame pointers enable BPF to unwind every userspace stack, we can get an accurate callstack from every BPF program we can think of. This makes certain kinds of debugging much easier, especially tools where we want to investigate who is calling a specific function.

A good example is the ustack() function in bpftrace. bpftrace is a high level tracing language for BPF that can easily hook into system calls, function calls, kernel tracepoints, and more. And since frame pointers guarantee that bpftrace's ustack() helper function works all the time, we're able to log the full callstack from every bpftrace script.

As another example, the bcc tools directory has around ten BPF tools with a -U option to print the userspace stack when tracing some event, such as tracing calls to cap_capable() for security capability checks, or just tracing slow function calls in general with the funcslower script. All these tools can only work reliably when all software running on the system is built with frame pointers.

All the above tooling enables in-depth debugging of applications without needing to modify the source code of the applications itself. BPF can be used to attach to function calls, kernel tracepoints, kernel functions, system calls, and all of this is presented in an easy to use fashion via bcc and bpftrace.

To summarize, BPF tooling that works with or benefits from stack trace information in general will work much more reliably when all software is built with frame pointers. As a result, implementing this change proposal will make the BPF tracing ecosystem of tools much more useful on Fedora in general, whereas currently many of the current BPF tools are hamstringed due to the lack of frame pointers.

Unwinding

How does the profiler get the list of function names? There are two parts of it:

  1. Unwinding the stack - getting a list of virtual addresses pointing to the executable code
  2. Symbolization - translating virtual addresses into human-readable information, like function name, inlined functions at the address, or file name and line number.

Unwinding is what we're interested in for the purpose of this proposal. The important things are:

  • Data on stack is split into frames, each frame belonging to one function.
  • Right before each function call, the return address is put on the stack. This is the instruction address in the caller to which we will eventually return — and that's what we care about.
  • One register, called the "frame pointer" or "base pointer" register (RBP), is traditionally used to point to the beginning of the current frame. Every function should back up RBP onto the stack and set it properly at the very beginning.

The “frame pointer” part is achieved by adding push %rbp, mov %rsp,%rbp to the beginning of every function and by adding pop %rbp before returning. Using this knowledge, stack unwinding boils down to traversing a linked list:

https://i.imgur.com/P6pFdPD.png

Where’s the catch?

The frame pointer register is not necessary to run a compiled binary. It makes it easy to unwind the stack, and some debugging tools rely on frame pointers, but the compiler knows how much data it put on the stack, so it can generate code that doesn't need the RBP. Not using the frame pointer register can make a program more efficient:

  • We don’t need to back up the value of the register onto the stack, which saves 3 instructions per function.
  • We can treat the RBP as a general-purpose register and use it for something else.

Whether the compiler sets frame pointer or not is controlled by the -fomit-frame-pointer flag and the default is "omit", meaning we can’t use this method of stack unwinding by default.

To make it possible to rely on the frame pointer being available, we'll add -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer to the default C/C++ compilation flags. This will instruct the compiler to make sure the frame pointer is always available. This will in turn allow profiling tools to provide accurate performance data which can drive performance improvements in core libraries and executables. It'll also make stacktraces from all BPF tooling more reliable as they'll be guaranteed to have access to a reliable stacktrace via frame pointers.

Feedback

Potential performance impact

  • Meta builds all its libraries and executables with -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer by default. Internal benchmarks did not show significant impact on performance when omitting the frame pointer for two of our most performance intensive applications.
  • From https://hal.inria.fr/hal-02297690/document, a paper on DWARF unwinding, we find that Google also compiles all its internal critical software with frame pointers to ensure fast and reliable backtraces.
  • Given that the kernel on Fedora already uses the ORC debuginfo format and this works well, we'll keep compiling the kernel without frame pointers since there's no benefits for profiling or debugging to be gained by compiling the kernel with frame pointers. This prevents any regressions in kernel performance such as those reported by https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u.
  • Brendan Gregg from Netflix advocates making -fno-omit-frame-pointer the default in GCC (https://www.brendangregg.com/Slides/SCALE2015_Linux_perf_profiling.pdf)

Should individual libraries or executables notice a significant performance degradation caused by including the frame pointer everywhere, these packages can opt-out on an individual basis as described in https://docs.fedoraproject.org/en-US/packaging-guidelines/#_compiler_flags.

Benchmarking of the performance impact

To verify the performance impact of the proposed change, we compared a number of benchmarks on a Fedora 37 system where every package is built with frame pointers against the same benchmarks on a regular Fedora 37 system. The source code for the benchmarks can be found in the fpbench repository on github. We used copr to build the packages required to run the benchmarks (and their dependencies) with frame pointers. Then, using mkosi, we build one Fedora 37 container with frame pointers and one Fedora 37 container without frame pointers and run various benchmarks inside these containers.

The results of the benchmarks can be found in the readme of the fpbench repository as we ran into formatting issues trying to add the results to the wiki.

Summarizing the results:

  • Compiling the kernel with GCC is 2.4% slower with frame pointers
  • Running Blender to render a frame is 2% slower on our specific testcase
  • openssl/botan/zstd do not seem to be affected significantly when built with frame pointers
  • The impact on CPython benchmarks can be anywhere from 1-10% depending on the specific benchmark
  • Redis benchmarks do not seem to be significantly impacted when built with frame pointers

Aside from the pyperformance benchmarks, the impact of building with frame pointers is limited on the benchmarks we performed. Our findings on the impact of the Python benchmarks when CPython is built with frame pointers are discussed in this comment.

Alternatives to frame pointers

There are a few alternative ways to unwind stacks instead of using the frame pointer:

  • DWARF data - The compiler can emit extra information that allows us to find the beginning of the frame without the frame pointer, which means we can walk the stack exactly as before. The problem is that we need to unwind the stack in kernel space which isn't implemented in the kernel. Given that the kernel implemented it's own format (ORC) instead of using DWARF, it's unlikely that we'll see a DWARF unwinder in the kernel any time soon. The perf tool allows you to use the DWARF data with --call-graph=dwarf, but this means that it copies the full stack on every event and unwinds in user space. This has very high overhead. For more details on why DWARF unwinding is slow, please see https://hal.inria.fr/hal-02297690/document which contains detailed information on the problems with DWARF unwinding.
  • ORC (undwarf) - problems with unwinding in kernel led to creation of another format with the same purpose as DWARF, just much simpler. This can only be used to unwind kernel stack traces; it doesn't help us with userspace stacks. More information on ORC can be found here.
  • LBR - New Intel CPUs have a feature that gives you source and target addresses for the last 16 (or 32, in newer CPUs) branches with no overhead. It can be configured to record only function calls and to be used as a stack, which means it can be used to get the stack trace. Sadly, you only get the last X calls, and not the full stack trace, so the data can be very incomplete. On top of that, many Fedora users might still be using CPUs without LBR support which means we wouldn't be able to assume working profilers on a Fedora system by default.
  • CTF Frame - An in progress RFC will add support to binutils to attach a new ctf_frame section to ELF binaries containing unwinding information. This new unwinding format claims to be more compact than eh_frame, faster to unwind, and simpler to implement an unwinder with. Should this format be accepted into binutils and should the kernel merge a CTF unwinder in the future, we could start building applications with CTF frame unwind information which could then be used in the kernel for unwinding userspace stacks instead of frame pointers. Unfortunately, CTF Frame is still a work-in-progress and won't be available for some time (if at all).
  • Shadow Stacks Shadow stacks are a hardware feature found on new Intel and AMD CPUs that improve security by copying return address information to a separate read-only shadow stack so that it's possible to verify the return address on the original stack wasn't modified by for example a buffer overflow. This information could potentially be used to unwind the stack. However, it's very early days for shadow stacks, they're only supported on very new CPU models, there's no kernel support yet and it's not completely certain that we'll be able to use this information for unwinding. As such, it's not a viable option for unwinding at this time but it might become one at some point in the future.

To summarize, if we want complete stacks with reasonably low overhead (which we do, there's no other way to get accurate profiling data from running services), frame pointers are currently the best option.

Benefit to Fedora

Implementing this change will provide profiling tools with easy access to stacktraces of installed libraries and executables which will lead to more accurate profiling data in general. This in turn can be used to implement optimizations to core libraries and executables which will improve the overall performance of Fedora itself and the wider Linux ecosystem.

Various debugging tools can also make use of the frame pointer to access the current stacktrace, although tools like gdb can already do this to some degree via embedded dwarf debugging info.

Scope

  • Proposal owners: Put up a PR to change the rpm macros to build packages by default with -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer by default.
  • Other developers: Review and merge the PR implementing the Change.
  • Policies and guidelines: N/A (not needed for this Change)
  • Trademark approval: N/A (not needed for this Change)
  • Alignment with Objectives: N/A

Upgrade/compatibility impact

This should not impact upgrades in any way.

How To Test

  1. Build the package with the updated rpm macros
  2. Profile the binary with perf record -g <binary>
  3. Inspect the perf data with perf report -g 'graph,0.5,caller'
  4. When expanding hot functions in the perf report, perf should show the full call graph of the hot function (at least for all functions that are part of the binary compiled with -fno-omit-frame-pointer)

User Experience

Fedora users will be more likely to have a streamlined experience when trying to debug/profile system executables/libraries. Tools such as perf will work out of the box instead of requiring to users to provide extra options (e.g. --call-graph=dwarf/LBR) or requiring users to recompile all relevant packages with frame pointers.

Dependencies

The rpm macros for Fedora need to be adjusted to include -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer in the default C/C++ compilation flags, and exclusions need to be added for performance-sensitive packages that don't benefit from being compiled with frame pointers.

The current list of packages that need to be excluded from this proposal is: - Any kernel packages

Contingency Plan

  • Contingency mechanism: The new version can be released without every package being rebuilt with fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer. Profiling will only work perfectly once all packages have been rebuilt but there will be no regression in behavior if not all packages have been rebuilt by the time of the release. If the Change is found to introduce unacceptable regressions, the PR implementing it can be reverted and affected packages can be rebuilt.
  • Contingency deadline: Final freeze
  • Blocks release? No

Documentation

Release Notes

Packages are now compiled with frame pointers included by default. This will enable a variety of profiling and debugging tools to show more information out of the box.