From Fedora Project Wiki
(Announcing the Change proposal)
(Change submitted to FESCo)
Line 12: Line 12:
== Current status ==
== Current status ==
[[Category:SystemWideChange]]
[[Category:SystemWideChange]]
[[Category:ChangeAnnounced]]
[[Category:ChangeReadyForFesco]]


* Targeted release: Fedora 34
* Targeted release: Fedora 34
* Last updated: {{REVISIONYEAR}}-{{REVISIONMONTH}}-{{REVISIONDAY2}}  
* Last updated: {{REVISIONYEAR}}-{{REVISIONMONTH}}-{{REVISIONDAY2}}  
* FESCo issue: <will be assigned by the Wrangler>
* FESCo issue: [https://pagure.io/fesco/issue/2535 #2535]
* Tracker bug: <will be assigned by the Wrangler>
* Tracker bug: <will be assigned by the Wrangler>
* Release notes tracker: <will be assigned by the Wrangler>
* Release notes tracker: <will be assigned by the Wrangler>

Revision as of 20:30, 30 December 2020

Enable systemd-oomd by default for all variants

Summary

Provide a better experience for Fedora users in out-of-memory (OOM) situations by enabling systemd-oomd by default. Actions taken by systemd-oomd operate on a per-cgroup level, aligning well with the life cycle of systemd units. systemd-oomd primarily uses Linux pressure stall information (PSI) to make decisions based on wasted productivity due to resource shortages; in addition to that, it also supports swap based actions.

Owners

Current status

  • Targeted release: Fedora 34
  • Last updated: 2020-12-30
  • FESCo issue: #2535
  • Tracker bug: <will be assigned by the Wrangler>
  • Release notes tracker: <will be assigned by the Wrangler>

Detailed description

The primary mechanism used by systemd-oomd for detecting when the system is out of memory is memory pressure. Memory pressure measures the percentage of time a cgroup has “wasted” due to lack of memory. This includes time spent reclaiming free memory, faulting in recently resident pages, and loading in anonymous pages from swap. When a monitored cgroup’s memory pressure exceeds the specified thresholds, systemd-oomd will perform action(s) on the targeted cgroup’s descendants, starting from the cgroups with the most reclaim scans. Reclaim activity is used here, rather than the largest consumer, as it reflects values set in the cgroup memory controller for memory protection (such as memory.low).

For memory pressure configuration, this will be ManagedOOMMemoryPressure=kill and ManagedOOMMemoryPressureLimit=4% on user@.service to have systemd-oomd send SIGKILLs to all processes under a selected cgroup when total memory pressure on all tasks exceeds 4% for 10 seconds.

For swap based actions, systemd-oomd will monitor the system-wide swap space and act when available swap falls below the configured threshold, starting with the cgroups with the highest swap usage to the least. Keeping some amount of swap (if enabled) available will prevent the kernel OOM killer from killing processes unpredictably and spending an unbounded amount of time afterwards.

For swap configuration, this will be SwapUsedLimitPercent=90% in oomd.conf and ManagedOOMSwap=kill on -.slice (root cgroup slice) to have systemd-oomd send SIGKILLs to all processes under a cgroup when swap used exceeds 90%.

Feedback

(pending initial discussion)

Benefit to Fedora

  • Addressing the issue of improving user feedback in https://pagure.io/fedora-workstation/issue/202, systemd-oomd currently logs to the journal if pressure or swap action is about to occur. There are also debug logs, for each process that is sent a SIGKILL, that can be bumped up in priority. Further notification mechanisms (i.e. over dbus) can also be implemented depending on feedback.
  • While systemd-oomd is simpler in configuration to the oomd used at Facebook, the algorithm is largely the same. As such, the following case study can be used as an example of how PSI and cgroup killing can release memory not normally resolved with process killing and lead to better utilization: https://facebookincubator.github.io/oomd/docs/oomd-casestudy.html
  • OOM killing in userspace, before the kernel OOM killer kicks in, has been shown to be effective at keeping a system functional. An OOM kill in the kernel is slow, possibly leading to an unbounded amount of time swapping in and out pages and evicting the page cache.
  • PSI based actions, versus looking at raw memory consumption numbers, better reflect memory protection policies set for cgroup resource control limits (e.g. memory.low).

Scope

  • Proposal owners:
    • Implement and land additional refinements to systemd-oomd
      • Remove swap as a hard requirement to running systemd-oomd
      • Expand ManagedOOM*= properties to user units (currently only usable on system units)
      • Configurable memory pressure time window knob
    • Enable oomd by default with sensible configuration
    • Test days
    • Aid with documentation
  • Other developers:
    • systemd: review PRs as needed
  • Release engineering: https://pagure.io/releng/issue/9913
  • Policies and guidelines: N/A
  • Trademark approval: N/A

Upgrade/compatibility impact

Existing systems running earlyoom will not be modified. One can transition to systemd-oomd via:

sudo systemctl disable --now earlyoom
sudo systemctl enable --now systemd-oomd

Systems that were previously not running earlyoom will have systemd-oomd enabled by default.

How to test

systemd 247 build for Fedora includes all the artifacts for systemd-oomd. It is disabled by default but can be started with:

sudo systemctl enable --now systemd-oomd

At this point you can decide which units to set properties on. For example, to enable swap-based killing on all units below the root slice:

sudo systemctl edit --force -- -.slice
[Slice]
ManagedOOMSwap=kill
# save and exit

Note that the following memory pressure example requires the changes listed in “Scope” to work as expected, as systemd-oomd shipped with systemd v247 does not support changing the time window for memory pressure. This example was run on a system with swap:

systemctl edit user@.service
[Service]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=4%
# save and exit

systemd-run --user tail /dev/zero # will lead to a lot of reclaim and then OOM if not killed

User experience

This should be a fully transparent change for users.

Dependencies

None. If changes to oomd are required to address feedback to this proposal, they will need to be merged in systemd.

Contingency plan

  • Contingency mechanism: For workstation, owner will revert all changes and we’ll go back to using earlyoom instead
  • Contingency deadline: Final freeze
  • Blocks release? No
  • Blocks product? No

Documentation

https://www.freedesktop.org/software/systemd/man/systemd-oomd.html
https://www.freedesktop.org/software/systemd/man/oomctl.html
https://www.freedesktop.org/software/systemd/man/oomd.conf.html

Release Notes

systemd-oomd is enabled by default. Depending on which systemd units have ManagedOOMSwap=kill or ManagedOOMMemoryPressure=kill, systemd-oomd will SIGKILL all the processes under the appropriate descendant cgroups when the configured limits are exceeded.

To revert back to earlyoom, run:

sudo systemctl disable --now systemd-oomd
sudo systemctl enable --now earlyoom

See man oomd.conf for configuration options.