Enable systemd-oomd by default for all variants
Summary
Provide a better experience for Fedora users in out-of-memory (OOM) situations by enabling systemd-oomd by default. Actions taken by systemd-oomd operate on a per-cgroup level, aligning well with the life cycle of systemd units. And in addition to swap based actions, systemd-oomd uses Linux pressure stall information (PSI) to make decisions based on wasted productivity due to resource shortages.
Owners
- Name: Anita Zhang, Davide Cavalca, Michel Salim
- Email: the.anitazha@gmail.com, dcavalca@fb.com, michel@michel-slm.name
- Products: All editions, spins, labs
Current status
- Targeted release: Fedora 34
- Last updated: 2020-12-17
- FESCo issue: tbd
- Tracker bug: tbd
- Release notes tracker: tbd
Detailed description
For swap based actions, systemd-oomd will monitor the system-wide swap space and act when available swap falls below the configured threshold, starting with the cgroups with the highest swap usage to the least. Keeping some amount of swap (if enabled) available will prevent the kernel OOM killer from killing processes unpredictably and spending an unbounded amount of time afterwards.
For swap configuration, this will be SwapUsedLimitPercent=90% in oomd.conf and ManagedOOMSwap=kill on -.slice (root cgroup slice) to have systemd-oomd send SIGKILLs to all processes under a cgroup when swap used exceeds 90%.
The primary mechanism used by systemd-oomd for detecting when the system is out of memory is memory pressure. Memory pressure measures the percentage of time a cgroup has “wasted” due to lack of memory. This includes time spent reclaiming free memory, faulting in recently resident pages, and loading in anonymous pages from swap. When a monitored cgroup’s memory pressure exceeds the specified thresholds, systemd-oomd will perform action(s) on the targeted cgroup’s descendants, starting from the cgroups with the most reclaim scans. Reclaim activity is used here, rather than the largest consumer, as it reflects values set in the cgroup memory controller for memory protection (such as memory.low).
For memory pressure configuration, this will be ManagedOOMMemoryPressure=kill and ManagedOOMMemoryPressureLimit=4% on user@.service to have systemd-oomd send SIGKILLs to all processes under a selected cgroup when total memory pressure on all tasks exceeds 4% for 10 seconds.
Feedback
(pending initial discussion)
Benefit to Fedora
- Addressing the issue of improving user feedback in https://pagure.io/fedora-workstation/issue/202, systemd-oomd currently logs to the journal if pressure or swap action is about to occur. There are also debug logs, for each process that is sent a SIGKILL, that can be bumped up in priority. Further notification mechanisms (i.e. over dbus) can also be implemented depending on feedback.
- While systemd-oomd is simpler in configuration to the oomd used at Facebook, the algorithm is largely the same. As such, the following case study can be used as an example of how PSI and cgroup killing can release memory not normally resolved with process killing and lead to better utilization: https://facebookincubator.github.io/oomd/docs/oomd-casestudy.html
- OOM killing in userspace, before the kernel OOM killer kicks in, has been shown to be effective at keeping a system functional. An OOM kill in the kernel is slow, possibly leading to an unbounded amount of time swapping in and out pages and evicting the page cache.
- PSI based actions, versus looking at raw memory consumption numbers, better reflect memory protection policies set for cgroup resource control limits (e.g. memory.low).
Scope
- Proposal owners:
- Implement and land additional refinements to systemd-oomd
- Remove swap as a hard requirement to running systemd-oomd
- Expand ManagedOOM*= properties to user units (currently only usable on system units)
- Configurable memory pressure time window knob
- Enable oomd by default with sensible configuration
- Test days
- Aid with documentation
- Implement and land additional refinements to systemd-oomd
- Other developers:
- systemd: review PRs as needed
- Release engineering: tbd
- Policies and guidelines: N/A
- Trademark approval: N/A
Upgrade/compatibility impact
Existing systems running earlyoom will not be modified. One can transition to systemd-oomd via:
sudo systemctl disable --now earlyoom sudo systemctl enable --now systemd-oomd
Systems that were previously not running earlyoom will have systemd-oomd enabled by default.
How to test
systemd 247 build for Fedora includes all the artifacts for systemd-oomd. It is disabled by default but can be started with:
sudo systemctl enable --now systemd-oomd
At this point you can decide which units to set properties on. For example, to enable swap-based killing on all units below the root slice:
sudo systemctl edit --force -- -.slice [Slice] ManagedOOMSwap=kill # save and exit
To try a contrived memory pressure test, you can create a service that will throttle while running a bloat script like the one in https://github.com/systemd/systemd/blob/master/test/units/testsuite-56-testbloat.service running https://github.com/systemd/systemd/blob/master/test/units/testsuite-56-slowgrowth.sh. Then set ManagedOOMMemoryPressure=kill on the ancestor unit (in the case of the service linked, it’s testsuite-56-workload.slice).
User experience
This should be a fully transparent change for users.
Dependencies
None. If changes to oomd are required to address feedback to this proposal, they will need to be merged in systemd.
Contingency plan
- Contingency mechanism: For workstation, owner will revert all changes and we’ll go back to using earlyoom instead
- Contingency deadline: Final freeze
- Blocks release? No
- Blocks product? No
Documentation
https://www.freedesktop.org/software/systemd-oomd.html
https://www.freedesktop.org/software/systemd/man/oomctl.html
https://www.freedesktop.org/software/systemd/man/oomd.conf.html
Release Notes
systemd-oomd is enabled by default. Depending on which systemd units have ManagedOOMSwap=kill or ManagedOOMMemoryPressure=kill, systemd-oomd will SIGKILL all the processes under the appropriate descendant cgroups when the configured limits are exceeded.
To revert back to earlyoom, run:
sudo systemctl disable --now systemd-oomd sudo systemctl enable --now earlyoom
See man oomd.conf for configuration options.