(Change submitted to FESCo) |
Chrismurphy (talk | contribs) (update memory pressure limit to 50% for 20s) |
||
(16 intermediate revisions by 4 users not shown) | |||
Line 12: | Line 12: | ||
== Current status == | == Current status == | ||
[[Category:SystemWideChange]] | [[Category:SystemWideChange]] | ||
[[Category: | [[Category:ChangeAcceptedF34]] | ||
* Targeted release: Fedora 34 | * Targeted release: Fedora 34 | ||
* Last updated: {{REVISIONYEAR}}-{{REVISIONMONTH}}-{{REVISIONDAY2}} | * Last updated: {{REVISIONYEAR}}-{{REVISIONMONTH}}-{{REVISIONDAY2}} | ||
* FESCo issue: [https://pagure.io/fesco/issue/2535 #2535] | * FESCo issue: [https://pagure.io/fesco/issue/2535 #2535] | ||
* Tracker bug: | * Tracker bug: [https://bugzilla.redhat.com/show_bug.cgi?id=1913794 #1913794] | ||
* Release notes tracker: | * Release notes tracker: [https://pagure.io/fedora-docs/release-notes/issue/627 #627] | ||
== Detailed description == | == Detailed description == | ||
Line 24: | Line 24: | ||
The primary mechanism used by systemd-oomd for detecting when the system is out of memory is memory pressure. Memory pressure measures the percentage of time a cgroup has “wasted” due to lack of memory. This includes time spent reclaiming free memory, faulting in recently resident pages, and loading in anonymous pages from swap. When a monitored cgroup’s memory pressure exceeds the specified thresholds, systemd-oomd will perform action(s) on the targeted cgroup’s descendants, starting from the cgroups with the most reclaim scans. Reclaim activity is used here, rather than the largest consumer, as it reflects values set in the cgroup memory controller for memory protection (such as memory.low). | The primary mechanism used by systemd-oomd for detecting when the system is out of memory is memory pressure. Memory pressure measures the percentage of time a cgroup has “wasted” due to lack of memory. This includes time spent reclaiming free memory, faulting in recently resident pages, and loading in anonymous pages from swap. When a monitored cgroup’s memory pressure exceeds the specified thresholds, systemd-oomd will perform action(s) on the targeted cgroup’s descendants, starting from the cgroups with the most reclaim scans. Reclaim activity is used here, rather than the largest consumer, as it reflects values set in the cgroup memory controller for memory protection (such as memory.low). | ||
For memory pressure configuration, this will be ManagedOOMMemoryPressure=kill and ManagedOOMMemoryPressureLimit= | For memory pressure configuration, this will be `ManagedOOMMemoryPressure=kill` and `ManagedOOMMemoryPressureLimit=50%` on `user@.service` to have systemd-oomd send SIGKILLs to all processes under a selected cgroup when total memory pressure on all tasks exceeds 50% for 20 seconds. | ||
For swap based actions, systemd-oomd will monitor the system-wide swap space and act when available swap falls below the configured threshold, starting with the cgroups with the highest swap usage to the least. Keeping some amount of swap (if enabled) available will prevent the kernel OOM killer from killing processes unpredictably and spending an unbounded amount of time afterwards. | For swap based actions, systemd-oomd will monitor the system-wide swap space and act when available swap falls below the configured threshold, starting with the cgroups with the highest swap usage to the least. Keeping some amount of swap (if enabled) available will prevent the kernel OOM killer from killing processes unpredictably and spending an unbounded amount of time afterwards. | ||
For swap configuration, this will be SwapUsedLimitPercent=90% in oomd.conf and ManagedOOMSwap=kill on -.slice (root cgroup slice) to have systemd-oomd send SIGKILLs to all processes under a cgroup when swap used exceeds 90%. | For swap configuration, this will be `SwapUsedLimitPercent=90%` in `oomd.conf` and `ManagedOOMSwap=kill` on `-.slice` (root cgroup slice) to have systemd-oomd send SIGKILLs to all processes under a cgroup when swap used exceeds 90%. | ||
== Feedback == | == Feedback == | ||
( | ==== Can we integrate this with GIO's [https://developer.gnome.org/gio/2.64/GMemoryMonitor.html GMemoryMonitor] API? ==== | ||
Likely yes, though it is not planned by the maintainers for the near term. | |||
==== Can we exclude certain units from being killed? ==== | |||
Setting `ManagedOOMPreference=avoid` or `ManagedOOMPreference=omit` on systemd units that are leaf cgroups nodes or cgroups with memory.oom.group set to 1 can prevent them from being targeted systemd-oomd. `avoid` de-prioritizes while `omit` is equivalent to systemd-oomd ignoring. Since they are meant to be used sparingly (e.g. for critical services), its usage is limited to root owned cgroups. | |||
==== How will this work if everything is in the same cgroup? ==== | |||
It will not work as systemd-oomd acts on a per-cgroup level. Applications will need to spawn processes into separate cgroups (e.g. with `systemd-run`) or use a desktop environment (e.g. GNOME, KDE) that does this for them. | |||
==== Should spins that don't put processes in separate cgroups be excluded from this change? ==== | |||
That will be left up to the maintainers of those spins. Based on feedback, the current plan is to enable systemd-oomd with the specified configuration by default to minimize fragmentation on the Fedora install base (the Upgrade/Compatibility section as been updated to reflect this). A separate subpackage, "systemd-oomd-defaults", controls the policy for systemd-oomd and excluding it or removing it (and performing a `systemctl daemon-reload`) will prevent systemd-oomd from killing anything; without a policy systemd-oomd doesn't act. | |||
== Benefit to Fedora == | == Benefit to Fedora == | ||
Line 46: | Line 60: | ||
** Implement and land additional refinements to systemd-oomd | ** Implement and land additional refinements to systemd-oomd | ||
*** Remove swap as a hard requirement to running systemd-oomd | *** Remove swap as a hard requirement to running systemd-oomd | ||
*** Configurable memory pressure time window knob | *** Configurable memory pressure time window knob | ||
** Enable oomd by default with sensible configuration | *** Per-unit knob to exclude units from being killing | ||
** Enable oomd by default with sensible configuration that can be easily opted out | |||
** Test days | ** Test days | ||
** Aid with documentation | ** Aid with documentation | ||
Line 59: | Line 73: | ||
== Upgrade/compatibility impact == | == Upgrade/compatibility impact == | ||
Systemd-oomd will be enabled by default, including on upgrade and new installs. Systems that were previously running earlyoom will be transitioned in a process similar to running these commands: | |||
<pre>sudo systemctl disable --now earlyoom | <pre>sudo systemctl disable --now earlyoom | ||
sudo systemctl enable --now systemd-oomd</pre> | sudo systemctl enable --now systemd-oomd</pre> | ||
== How to test == | == How to test == | ||
Line 82: | Line 95: | ||
[Service] | [Service] | ||
ManagedOOMMemoryPressure=kill | ManagedOOMMemoryPressure=kill | ||
ManagedOOMMemoryPressureLimit= | ManagedOOMMemoryPressureLimit=10% | ||
# save and exit | # save and exit | ||
Line 110: | Line 123: | ||
== Release Notes == | == Release Notes == | ||
systemd-oomd is enabled by default. Depending on which systemd units have ManagedOOMSwap=kill or ManagedOOMMemoryPressure=kill, systemd-oomd will SIGKILL all the processes under the appropriate descendant cgroups when the configured limits are exceeded. | systemd-oomd is enabled by default. Depending on which systemd units have `ManagedOOMSwap=kill` or `ManagedOOMMemoryPressure=kill`, systemd-oomd will SIGKILL all the processes under the appropriate descendant cgroups when the configured limits are exceeded. | ||
To revert back to earlyoom, run: | To revert back to earlyoom, run: |
Latest revision as of 19:54, 30 March 2021
Enable systemd-oomd by default for all variants
Summary
Provide a better experience for Fedora users in out-of-memory (OOM) situations by enabling systemd-oomd by default. Actions taken by systemd-oomd operate on a per-cgroup level, aligning well with the life cycle of systemd units. systemd-oomd primarily uses Linux pressure stall information (PSI) to make decisions based on wasted productivity due to resource shortages; in addition to that, it also supports swap based actions.
Owners
- Name: Anita Zhang, Davide Cavalca, Michel Salim, Tejun Heo, Rik van Riel
- Email: the.anitazha@gmail.com, dcavalca@fb.com, michel@michel-slm.name, htejun@fb.com, riel@fb.com
Current status
- Targeted release: Fedora 34
- Last updated: 2021-03-30
- FESCo issue: #2535
- Tracker bug: #1913794
- Release notes tracker: #627
Detailed description
The primary mechanism used by systemd-oomd for detecting when the system is out of memory is memory pressure. Memory pressure measures the percentage of time a cgroup has “wasted” due to lack of memory. This includes time spent reclaiming free memory, faulting in recently resident pages, and loading in anonymous pages from swap. When a monitored cgroup’s memory pressure exceeds the specified thresholds, systemd-oomd will perform action(s) on the targeted cgroup’s descendants, starting from the cgroups with the most reclaim scans. Reclaim activity is used here, rather than the largest consumer, as it reflects values set in the cgroup memory controller for memory protection (such as memory.low).
For memory pressure configuration, this will be ManagedOOMMemoryPressure=kill
and ManagedOOMMemoryPressureLimit=50%
on user@.service
to have systemd-oomd send SIGKILLs to all processes under a selected cgroup when total memory pressure on all tasks exceeds 50% for 20 seconds.
For swap based actions, systemd-oomd will monitor the system-wide swap space and act when available swap falls below the configured threshold, starting with the cgroups with the highest swap usage to the least. Keeping some amount of swap (if enabled) available will prevent the kernel OOM killer from killing processes unpredictably and spending an unbounded amount of time afterwards.
For swap configuration, this will be SwapUsedLimitPercent=90%
in oomd.conf
and ManagedOOMSwap=kill
on -.slice
(root cgroup slice) to have systemd-oomd send SIGKILLs to all processes under a cgroup when swap used exceeds 90%.
Feedback
Can we integrate this with GIO's GMemoryMonitor API?
Likely yes, though it is not planned by the maintainers for the near term.
Can we exclude certain units from being killed?
Setting ManagedOOMPreference=avoid
or ManagedOOMPreference=omit
on systemd units that are leaf cgroups nodes or cgroups with memory.oom.group set to 1 can prevent them from being targeted systemd-oomd. avoid
de-prioritizes while omit
is equivalent to systemd-oomd ignoring. Since they are meant to be used sparingly (e.g. for critical services), its usage is limited to root owned cgroups.
How will this work if everything is in the same cgroup?
It will not work as systemd-oomd acts on a per-cgroup level. Applications will need to spawn processes into separate cgroups (e.g. with systemd-run
) or use a desktop environment (e.g. GNOME, KDE) that does this for them.
Should spins that don't put processes in separate cgroups be excluded from this change?
That will be left up to the maintainers of those spins. Based on feedback, the current plan is to enable systemd-oomd with the specified configuration by default to minimize fragmentation on the Fedora install base (the Upgrade/Compatibility section as been updated to reflect this). A separate subpackage, "systemd-oomd-defaults", controls the policy for systemd-oomd and excluding it or removing it (and performing a systemctl daemon-reload
) will prevent systemd-oomd from killing anything; without a policy systemd-oomd doesn't act.
Benefit to Fedora
- Addressing the issue of improving user feedback in https://pagure.io/fedora-workstation/issue/202, systemd-oomd currently logs to the journal if pressure or swap action is about to occur. There are also debug logs, for each process that is sent a SIGKILL, that can be bumped up in priority. Further notification mechanisms (i.e. over dbus) can also be implemented depending on feedback.
- While systemd-oomd is simpler in configuration to the oomd used at Facebook, the algorithm is largely the same. As such, the following case study can be used as an example of how PSI and cgroup killing can release memory not normally resolved with process killing and lead to better utilization: https://facebookincubator.github.io/oomd/docs/oomd-casestudy.html
- OOM killing in userspace, before the kernel OOM killer kicks in, has been shown to be effective at keeping a system functional. An OOM kill in the kernel is slow, possibly leading to an unbounded amount of time swapping in and out pages and evicting the page cache.
- PSI based actions, versus looking at raw memory consumption numbers, better reflect memory protection policies set for cgroup resource control limits (e.g. memory.low).
Scope
- Proposal owners:
- Implement and land additional refinements to systemd-oomd
- Remove swap as a hard requirement to running systemd-oomd
- Configurable memory pressure time window knob
- Per-unit knob to exclude units from being killing
- Enable oomd by default with sensible configuration that can be easily opted out
- Test days
- Aid with documentation
- Implement and land additional refinements to systemd-oomd
- Other developers:
- systemd: review PRs as needed
- Release engineering: https://pagure.io/releng/issue/9913
- Policies and guidelines: N/A
- Trademark approval: N/A
Upgrade/compatibility impact
Systemd-oomd will be enabled by default, including on upgrade and new installs. Systems that were previously running earlyoom will be transitioned in a process similar to running these commands:
sudo systemctl disable --now earlyoom sudo systemctl enable --now systemd-oomd
How to test
systemd 247 build for Fedora includes all the artifacts for systemd-oomd. It is disabled by default but can be started with:
sudo systemctl enable --now systemd-oomd
At this point you can decide which units to set properties on. For example, to enable swap-based killing on all units below the root slice:
sudo systemctl edit --force -- -.slice [Slice] ManagedOOMSwap=kill # save and exit
Note that the following memory pressure example requires the changes listed in “Scope” to work as expected, as systemd-oomd shipped with systemd v247 does not support changing the time window for memory pressure. This example was run on a system with swap:
systemctl edit user@.service [Service] ManagedOOMMemoryPressure=kill ManagedOOMMemoryPressureLimit=10% # save and exit systemd-run --user tail /dev/zero # will lead to a lot of reclaim and then OOM if not killed
User experience
This should be a fully transparent change for users.
Dependencies
None. If changes to oomd are required to address feedback to this proposal, they will need to be merged in systemd.
Contingency plan
- Contingency mechanism: For workstation, owner will revert all changes and we’ll go back to using earlyoom instead
- Contingency deadline: Final freeze
- Blocks release? No
- Blocks product? No
Documentation
https://www.freedesktop.org/software/systemd/man/systemd-oomd.html
https://www.freedesktop.org/software/systemd/man/oomctl.html
https://www.freedesktop.org/software/systemd/man/oomd.conf.html
Release Notes
systemd-oomd is enabled by default. Depending on which systemd units have ManagedOOMSwap=kill
or ManagedOOMMemoryPressure=kill
, systemd-oomd will SIGKILL all the processes under the appropriate descendant cgroups when the configured limits are exceeded.
To revert back to earlyoom, run:
sudo systemctl disable --now systemd-oomd sudo systemctl enable --now earlyoom
See man oomd.conf for configuration options.