(Change approved by FESCo https://pagure.io/fesco/issue/2290#comment-614554) |
No edit summary |
||
(2 intermediate revisions by 2 users not shown) | |||
Line 6: | Line 6: | ||
== Summary == | == Summary == | ||
We add the <code>-fno-semantic-interposition</code> compiler | We add the <code>-fno-semantic-interposition</code> compiler flag when building Python interpreters, as it provides significant performance improvement, up to 27% depending on the workload. Users will no longer be able to use LD_PRELOAD to override a symbol from libpython, which we consider a good trade off for the speedup. | ||
== Owner == | == Owner == | ||
Line 30: | Line 30: | ||
CLOSED as NEXTRELEASE -> change is completed and verified and will be delivered in next release under development | CLOSED as NEXTRELEASE -> change is completed and verified and will be delivered in next release under development | ||
--> | --> | ||
* Tracker bug: | * Tracker bug: [https://bugzilla.redhat.com/show_bug.cgi?id=1779341 #1779341] | ||
* Release notes tracker: | * Release notes tracker: [https://pagure.io/fedora-docs/release-notes/issue/421 #421] | ||
== Detailed Description == | == Detailed Description == | ||
Line 37: | Line 37: | ||
When we build the Python interpreter with the <code>-fno-semantic-interposition</code> compiler/linker flag, we can achieve a performance gain of 5% to 27% depending on the workload. Link time optimizations and profile guided optimizations also have a greater impact when python3 is built this way. | When we build the Python interpreter with the <code>-fno-semantic-interposition</code> compiler/linker flag, we can achieve a performance gain of 5% to 27% depending on the workload. Link time optimizations and profile guided optimizations also have a greater impact when python3 is built this way. | ||
For a vague-linkage function definition, a call site in the same translation unit may inline the callee. Whether -fno-semantic-interposition is enabled has no effect. | |||
For a non-vague-linkage function definition, by default (-fsemantic-interposition) the -fpic mode does not allow a call site in the same translation unit to inline the callee or perform other interprocedural optimizations. -fno-semantic-interposition re-enables interprocedural optimizations. | |||
If a caller inlines a callee, using LD_PRELOAD to interpose the callee will not affect the caller. But many other LD_PRELOAD usage still work. We consider the small LD_PRELOAD limitation a good trade off for the speedup. | |||
Interposition is enabled by default in compilers like GCC: function calls to a library goes through a "Procedure Linkage Table" (PLT). This indirection is required to allow a library loaded by LD_PRELOAD environment variable to override a function. The indirection puts more pressure on the CPU level 1 cache (instruction cache). In terms of performance, the main drawback is that function calls from a library to the same library cannot be inlined, to respect the interposition semantics. Inlining is usually a big win in terms of performance. | Interposition is enabled by default in compilers like GCC: function calls to a library goes through a "Procedure Linkage Table" (PLT). This indirection is required to allow a library loaded by LD_PRELOAD environment variable to override a function. The indirection puts more pressure on the CPU level 1 cache (instruction cache). In terms of performance, the main drawback is that function calls from a library to the same library cannot be inlined, to respect the interposition semantics. Inlining is usually a big win in terms of performance. | ||
Line 205: | Line 209: | ||
** Monitor Koschei for significant problems. | ** Monitor Koschei for significant problems. | ||
** Backport the change to alternate Python versions. | ** Backport the change to alternate Python versions. | ||
** Attempt to upstream the change: https://bugs.python.org/issue38980 | |||
* Other developers are encouraged to check if their package works as expected <!-- REQUIRED FOR SYSTEM WIDE CHANGES --> | * Other developers are encouraged to check if their package works as expected <!-- REQUIRED FOR SYSTEM WIDE CHANGES --> |
Latest revision as of 13:38, 9 June 2021
Build Python with -fno-semantic-interposition for better performance
Summary
We add the -fno-semantic-interposition
compiler flag when building Python interpreters, as it provides significant performance improvement, up to 27% depending on the workload. Users will no longer be able to use LD_PRELOAD to override a symbol from libpython, which we consider a good trade off for the speedup.
Owner
- Name: Charalampos Stratakis, Victor Stinner, Miro Hrončok
- Email: python-maint@redhat.com
- Shout-out: Jan Kratochvíl for first suggesting this instead of the original proposal, followed by Kevin Kofler. Florian Weimer for providing answers to our questions. David Gray for originally suggesting to link Python statically to gain performance.
Current status
- Targeted release: Fedora 32
- Last updated: 2021-06-09
- Tracker bug: #1779341
- Release notes tracker: #421
Detailed Description
When we build the Python interpreter with the -fno-semantic-interposition
compiler/linker flag, we can achieve a performance gain of 5% to 27% depending on the workload. Link time optimizations and profile guided optimizations also have a greater impact when python3 is built this way.
For a vague-linkage function definition, a call site in the same translation unit may inline the callee. Whether -fno-semantic-interposition is enabled has no effect.
For a non-vague-linkage function definition, by default (-fsemantic-interposition) the -fpic mode does not allow a call site in the same translation unit to inline the callee or perform other interprocedural optimizations. -fno-semantic-interposition re-enables interprocedural optimizations.
If a caller inlines a callee, using LD_PRELOAD to interpose the callee will not affect the caller. But many other LD_PRELOAD usage still work. We consider the small LD_PRELOAD limitation a good trade off for the speedup.
Interposition is enabled by default in compilers like GCC: function calls to a library goes through a "Procedure Linkage Table" (PLT). This indirection is required to allow a library loaded by LD_PRELOAD environment variable to override a function. The indirection puts more pressure on the CPU level 1 cache (instruction cache). In terms of performance, the main drawback is that function calls from a library to the same library cannot be inlined, to respect the interposition semantics. Inlining is usually a big win in terms of performance.
Disabling interposition for libpython removes the overhead on function calls by avoiding the PLT indirection, and allows to inline more function calls. We're describing function calls from libpython to libpython, something which is very common in Python: almost all function calls are calls from libpython to libpython.
If Fedora users need to use LD_PRELOAD to override symbols in libpython, the recommended way is to build a custom Python without -fno-semantic-interposition
.
It is still possible to use LD_PRELOAD to override symbols in other libraries (for example in glibc).
Affected Pythons
Primarily, we will change the interpreter in the python3
package, that is Python 3.8 in Fedora 32 and any later version of Python in future Fedora releases.
Impact on other Python packages (and generally software using Python) is not anticipated (other than the possible speedup).
We will also change the alternate Python interpreters where possible and useful, primarily the upstream supported versions of CPython, such as python39
(if already packaged), python37
and python36
.
Affected Fedora releases
This is a Fedora 32 change and it will be implemented in Rawhide (Fedora 32) only. Any future versions of Fedora will inherit the change until it is reverted for some reason.
If it turns out that there are absolutely no issues, we might consider backporting the speedup to already released Fedora versions (for example Fedora 31). Such action would be separately coordinated with FESCo.
Benefit to Fedora
Python's performance will increase significantly depending on the workload. Since many core components of the OS also depend on Python this could lead to an increase in their performance as well, however individual benchmarks will need to be conducted to verify the performance gain for those components.
pyperformance results, ignoring differences smaller than 5%:
+-------------------------+------------------------------+------------------------------+ | Benchmark | python38-3.8.0-1 (original) | python38-3.8.0-2 (changed) | +=========================+==============================+==============================+ | scimark_lu | 294 ms | 213 ms: 1.38x faster (-27%) | +-------------------------+------------------------------+------------------------------+ | scimark_sparse_mat_mult | 8.61 ms | 6.39 ms: 1.35x faster (-26%) | +-------------------------+------------------------------+------------------------------+ | nbody | 236 ms | 179 ms: 1.32x faster (-24%) | +-------------------------+------------------------------+------------------------------+ | django_template | 203 ms | 158 ms: 1.29x faster (-22%) | +-------------------------+------------------------------+------------------------------+ | raytrace | 910 ms | 709 ms: 1.28x faster (-22%) | +-------------------------+------------------------------+------------------------------+ | logging_format | 17.7 us | 13.8 us: 1.28x faster (-22%) | +-------------------------+------------------------------+------------------------------+ | richards | 124 ms | 97.2 ms: 1.27x faster (-21%) | +-------------------------+------------------------------+------------------------------+ | unpickle | 23.9 us | 18.8 us: 1.27x faster (-21%) | +-------------------------+------------------------------+------------------------------+ | chaos | 200 ms | 158 ms: 1.26x faster (-21%) | +-------------------------+------------------------------+------------------------------+ | hexiom | 17.6 ms | 14.0 ms: 1.26x faster (-21%) | +-------------------------+------------------------------+------------------------------+ | logging_simple | 15.8 us | 12.5 us: 1.26x faster (-21%) | +-------------------------+------------------------------+------------------------------+ | nqueens | 179 ms | 142 ms: 1.26x faster (-20%) | +-------------------------+------------------------------+------------------------------+ | logging_silent | 340 ns | 273 ns: 1.25x faster (-20%) | +-------------------------+------------------------------+------------------------------+ | crypto_pyaes | 201 ms | 162 ms: 1.24x faster (-19%) | +-------------------------+------------------------------+------------------------------+ | scimark_fft | 653 ms | 527 ms: 1.24x faster (-19%) | +-------------------------+------------------------------+------------------------------+ | scimark_monte_carlo | 190 ms | 154 ms: 1.24x faster (-19%) | +-------------------------+------------------------------+------------------------------+ | pickle_pure_python | 795 us | 646 us: 1.23x faster (-19%) | +-------------------------+------------------------------+------------------------------+ | go | 443 ms | 361 ms: 1.23x faster (-18%) | +-------------------------+------------------------------+------------------------------+ | deltablue | 12.6 ms | 10.4 ms: 1.22x faster (-18%) | +-------------------------+------------------------------+------------------------------+ | spectral_norm | 245 ms | 201 ms: 1.22x faster (-18%) | +-------------------------+------------------------------+------------------------------+ | float | 203 ms | 167 ms: 1.21x faster (-18%) | +-------------------------+------------------------------+------------------------------+ | mako | 27.0 ms | 22.2 ms: 1.21x faster (-18%) | +-------------------------+------------------------------+------------------------------+ | scimark_sor | 347 ms | 286 ms: 1.21x faster (-17%) | +-------------------------+------------------------------+------------------------------+ | unpickle_pure_python | 575 us | 475 us: 1.21x faster (-17%) | +-------------------------+------------------------------+------------------------------+ | fannkuch | 803 ms | 667 ms: 1.20x faster (-17%) | +-------------------------+------------------------------+------------------------------+ | pathlib | 35.3 ms | 29.5 ms: 1.20x faster (-17%) | +-------------------------+------------------------------+------------------------------+ | pyflate | 1.15 sec | 959 ms: 1.19x faster (-16%) | +-------------------------+------------------------------+------------------------------+ | sympy_expand | 707 ms | 600 ms: 1.18x faster (-15%) | +-------------------------+------------------------------+------------------------------+ | regex_compile | 303 ms | 258 ms: 1.18x faster (-15%) | +-------------------------+------------------------------+------------------------------+ | chameleon | 15.7 ms | 13.3 ms: 1.18x faster (-15%) | +-------------------------+------------------------------+------------------------------+ | sympy_str | 461 ms | 394 ms: 1.17x faster (-15%) | +-------------------------+------------------------------+------------------------------+ | genshi_xml | 104 ms | 88.4 ms: 1.17x faster (-15%) | +-------------------------+------------------------------+------------------------------+ | dulwich_log | 116 ms | 100 ms: 1.16x faster (-14%) | +-------------------------+------------------------------+------------------------------+ | sympy_integrate | 34.4 ms | 29.9 ms: 1.15x faster (-13%) | +-------------------------+------------------------------+------------------------------+ | genshi_text | 49.1 ms | 42.9 ms: 1.15x faster (-13%) | +-------------------------+------------------------------+------------------------------+ | 2to3 | 535 ms | 471 ms: 1.14x faster (-12%) | +-------------------------+------------------------------+------------------------------+ | json_dumps | 20.4 ms | 18.0 ms: 1.13x faster (-12%) | +-------------------------+------------------------------+------------------------------+ | sympy_sum | 285 ms | 252 ms: 1.13x faster (-12%) | +-------------------------+------------------------------+------------------------------+ | xml_etree_process | 128 ms | 114 ms: 1.12x faster (-11%) | +-------------------------+------------------------------+------------------------------+ | sqlite_synth | 4.75 us | 4.24 us: 1.12x faster (-11%) | +-------------------------+------------------------------+------------------------------+ | telco | 10.1 ms | 8.98 ms: 1.12x faster (-11%) | +-------------------------+------------------------------+------------------------------+ | meteor_contest | 168 ms | 150 ms: 1.12x faster (-11%) | +-------------------------+------------------------------+------------------------------+ | sqlalchemy_imperative | 53.3 ms | 47.7 ms: 1.12x faster (-11%) | +-------------------------+------------------------------+------------------------------+ | tornado_http | 425 ms | 382 ms: 1.11x faster (-10%) | +-------------------------+------------------------------+------------------------------+ | xml_etree_generate | 159 ms | 144 ms: 1.10x faster (-9%) | +-------------------------+------------------------------+------------------------------+ | sqlalchemy_declarative | 271 ms | 251 ms: 1.08x faster (-7%) | +-------------------------+------------------------------+------------------------------+ | json_loads | 43.5 us | 40.4 us: 1.08x faster (-7%) | +-------------------------+------------------------------+------------------------------+ | python_startup | 13.9 ms | 13.1 ms: 1.06x faster (-6%) | +-------------------------+------------------------------+------------------------------+ | unpickle_list | 6.68 us | 6.29 us: 1.06x faster (-6%) | +-------------------------+------------------------------+------------------------------+
Scope
- Proposal owners:
- Review and merge the pull request with the implementation.
- Monitor Koschei for significant problems.
- Backport the change to alternate Python versions.
- Attempt to upstream the change: https://bugs.python.org/issue38980
- Other developers are encouraged to check if their package works as expected
- Release engineering: N/A (not needed for this Change) -- this change does not require a mass rebuild nor any other special releng work
- Policies and guidelines: N/A (not needed for this Change)
- Trademark approval: N/A (not needed for this Change)
Upgrade/compatibility impact
Python package maintainers should verify that their packages work as expected and the only impact the end users should see is a performance increase for workloads relying on Python.
How To Test
Test that everything Python related in Fedora works as usual.
Was the flag applied test
You can test whether the -fno-semantic-interposition
flag was applied for your Python build:
>>> import sysconfig >>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_CFLAGS') + sysconfig.get_config_var('PY_CFLAGS_NODIST')) True >>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_LDFLAGS') + sysconfig.get_config_var('PY_LDFLAGS_NODIST')) True
Before the change, you would see False
, False
.
Performance test
The performance speedup can be measured using the official Python benchmark suite pyperformance: see Run benchmarks.
User Experience
Python based workloads should see a performance gain of up to 27%.
Dependencies
This change is not dependent on anything else.
Contingency Plan
- Contingency mechanism: If issues appear that cannot be fixed in a timely manner the change can be easily reverted and will be considered again for the next fedora release.
- Contingency deadline: Before the beta freeze of Fedora 32 (2020-02-25)
- Blocks release? Yes
- Blocks product? None
Documentation
This change proposal has all the documentation.
See the previous change proposal and the thread about it on the devel mailing list for more relevant information about what we are not doing