From Fedora Project Wiki
(It helps to actually change the category, Benjamin)
No edit summary
 
(6 intermediate revisions by 3 users not shown)
Line 6: Line 6:


== Summary ==
== Summary ==
We add the <code>-fno-semantic-interposition</code> compiler/linker flag when building Python interpreters, as it provides significant performance improvement, up to 27% depending on the workload. Users will no longer be able to use LD_PRELOAD to override a symbol from libpython, which we consider a good trade off for the speedup.
We add the <code>-fno-semantic-interposition</code> compiler flag when building Python interpreters, as it provides significant performance improvement, up to 27% depending on the workload. Users will no longer be able to use LD_PRELOAD to override a symbol from libpython, which we consider a good trade off for the speedup.


== Owner ==
== Owner ==
Line 30: Line 30:
CLOSED as NEXTRELEASE -> change is completed and verified and will be delivered in next release under development
CLOSED as NEXTRELEASE -> change is completed and verified and will be delivered in next release under development
-->
-->
* Tracker bug: <will be assigned by the Wrangler>
* Tracker bug: [https://bugzilla.redhat.com/show_bug.cgi?id=1779341 #1779341]
* Release notes tracker: <will be assigned by the Wrangler>
* Release notes tracker: [https://pagure.io/fedora-docs/release-notes/issue/421 #421]


== Detailed Description ==
== Detailed Description ==
Line 37: Line 37:
When we build the Python interpreter with the <code>-fno-semantic-interposition</code> compiler/linker flag, we can achieve a performance gain of 5% to 27% depending on the workload. Link time optimizations and profile guided optimizations also have a greater impact when python3 is built this way.
When we build the Python interpreter with the <code>-fno-semantic-interposition</code> compiler/linker flag, we can achieve a performance gain of 5% to 27% depending on the workload. Link time optimizations and profile guided optimizations also have a greater impact when python3 is built this way.


As a negative side effect, it disables the LD_PRELOAD feature: it's no longer possible to override symbols in libpython with LD_PRELOAD.
For a vague-linkage function definition, a call site in the same translation unit may inline the callee. Whether -fno-semantic-interposition is enabled has no effect.


Interposition is enabled by default in compilers like GCC: function calls to a library goes through a "Procedure Linkage Table" (PLT). This indirection is required to allow a library loaded by LD_PRELOAD environment variable to override a function. The indirection puts more pressure on the CPU level 1 cache (instruction cache). In term of performance, the main drawback is that function calls from a library to the same library cannot be inlined, to respect the interposition semantics. Inlining is usually a big win in term of performance.
For a non-vague-linkage function definition, by default (-fsemantic-interposition) the -fpic mode does not allow a call site in the same translation unit to inline the callee or perform other interprocedural optimizations. -fno-semantic-interposition re-enables interprocedural optimizations.
 
If a caller inlines a callee, using LD_PRELOAD to interpose the callee will not affect the caller. But many other LD_PRELOAD usage still work. We consider the small LD_PRELOAD limitation a good trade off for the speedup.
 
Interposition is enabled by default in compilers like GCC: function calls to a library goes through a "Procedure Linkage Table" (PLT). This indirection is required to allow a library loaded by LD_PRELOAD environment variable to override a function. The indirection puts more pressure on the CPU level 1 cache (instruction cache). In terms of performance, the main drawback is that function calls from a library to the same library cannot be inlined, to respect the interposition semantics. Inlining is usually a big win in terms of performance.


Disabling interposition for libpython removes the overhead on function calls by avoiding the PLT indirection, and allows to inline more function calls. We're describing function calls from libpython to libpython, something which is very common in Python: almost all function calls are calls from libpython to libpython.
Disabling interposition for libpython removes the overhead on function calls by avoiding the PLT indirection, and allows to inline more function calls. We're describing function calls from libpython to libpython, something which is very common in Python: almost all function calls are calls from libpython to libpython.


If Fedora users need to use LD_PRELOAD to override symbols in libpython, the recommend way is to build a custom Python without <code>-fno-semantic-interposition</code>.
If Fedora users need to use LD_PRELOAD to override symbols in libpython, the recommended way is to build a custom Python without <code>-fno-semantic-interposition</code>.


It is still possible to use LD_PRELOAD to override symbols in other libraries (for example in glibc).
It is still possible to use LD_PRELOAD to override symbols in other libraries (for example in glibc).
Line 205: Line 209:
** Monitor Koschei for significant problems.
** Monitor Koschei for significant problems.
** Backport the change to alternate Python versions.
** Backport the change to alternate Python versions.
** Attempt to upstream the change: https://bugs.python.org/issue38980


* Other developers are encouraged to check if their package works as expected <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
* Other developers are encouraged to check if their package works as expected <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
Line 305: Line 310:
-->
-->


[[Category:ChangeAnnounced]]
TBD. Be sure to mention PEP [https://www.python.org/dev/peps/pep-0445/ 445] and [https://www.python.org/dev/peps/pep-0454/ 454].
 
[[Category:ChangeAcceptedF32]]
<!-- When your change proposal page is completed and ready for review and announcement -->
<!-- When your change proposal page is completed and ready for review and announcement -->
<!-- remove Category:ChangePageIncomplete and change it to Category:ChangeReadyForWrangler -->
<!-- remove Category:ChangePageIncomplete and change it to Category:ChangeReadyForWrangler -->

Latest revision as of 13:38, 9 June 2021

Build Python with -fno-semantic-interposition for better performance

Simplified version of another change proposal
This change was originally proposed for Fedora 32 as Changes/PythonStaticSpeedup, however based on community feedback, it has been significantly reduced.

Summary

We add the -fno-semantic-interposition compiler flag when building Python interpreters, as it provides significant performance improvement, up to 27% depending on the workload. Users will no longer be able to use LD_PRELOAD to override a symbol from libpython, which we consider a good trade off for the speedup.

Owner

Current status

Detailed Description

When we build the Python interpreter with the -fno-semantic-interposition compiler/linker flag, we can achieve a performance gain of 5% to 27% depending on the workload. Link time optimizations and profile guided optimizations also have a greater impact when python3 is built this way.

For a vague-linkage function definition, a call site in the same translation unit may inline the callee. Whether -fno-semantic-interposition is enabled has no effect.

For a non-vague-linkage function definition, by default (-fsemantic-interposition) the -fpic mode does not allow a call site in the same translation unit to inline the callee or perform other interprocedural optimizations. -fno-semantic-interposition re-enables interprocedural optimizations.

If a caller inlines a callee, using LD_PRELOAD to interpose the callee will not affect the caller. But many other LD_PRELOAD usage still work. We consider the small LD_PRELOAD limitation a good trade off for the speedup.

Interposition is enabled by default in compilers like GCC: function calls to a library goes through a "Procedure Linkage Table" (PLT). This indirection is required to allow a library loaded by LD_PRELOAD environment variable to override a function. The indirection puts more pressure on the CPU level 1 cache (instruction cache). In terms of performance, the main drawback is that function calls from a library to the same library cannot be inlined, to respect the interposition semantics. Inlining is usually a big win in terms of performance.

Disabling interposition for libpython removes the overhead on function calls by avoiding the PLT indirection, and allows to inline more function calls. We're describing function calls from libpython to libpython, something which is very common in Python: almost all function calls are calls from libpython to libpython.

If Fedora users need to use LD_PRELOAD to override symbols in libpython, the recommended way is to build a custom Python without -fno-semantic-interposition.

It is still possible to use LD_PRELOAD to override symbols in other libraries (for example in glibc).

Affected Pythons

Primarily, we will change the interpreter in the python3 package, that is Python 3.8 in Fedora 32 and any later version of Python in future Fedora releases.

Impact on other Python packages (and generally software using Python) is not anticipated (other than the possible speedup).

We will also change the alternate Python interpreters where possible and useful, primarily the upstream supported versions of CPython, such as python39 (if already packaged), python37 and python36.

Affected Fedora releases

This is a Fedora 32 change and it will be implemented in Rawhide (Fedora 32) only. Any future versions of Fedora will inherit the change until it is reverted for some reason.

If it turns out that there are absolutely no issues, we might consider backporting the speedup to already released Fedora versions (for example Fedora 31). Such action would be separately coordinated with FESCo.

Benefit to Fedora

Python's performance will increase significantly depending on the workload. Since many core components of the OS also depend on Python this could lead to an increase in their performance as well, however individual benchmarks will need to be conducted to verify the performance gain for those components.

pyperformance results, ignoring differences smaller than 5%:

+-------------------------+------------------------------+------------------------------+
| Benchmark               | python38-3.8.0-1 (original)  | python38-3.8.0-2 (changed)   |
+=========================+==============================+==============================+
| scimark_lu              | 294 ms                       | 213 ms: 1.38x faster (-27%)  |
+-------------------------+------------------------------+------------------------------+
| scimark_sparse_mat_mult | 8.61 ms                      | 6.39 ms: 1.35x faster (-26%) |
+-------------------------+------------------------------+------------------------------+
| nbody                   | 236 ms                       | 179 ms: 1.32x faster (-24%)  |
+-------------------------+------------------------------+------------------------------+
| django_template         | 203 ms                       | 158 ms: 1.29x faster (-22%)  |
+-------------------------+------------------------------+------------------------------+
| raytrace                | 910 ms                       | 709 ms: 1.28x faster (-22%)  |
+-------------------------+------------------------------+------------------------------+
| logging_format          | 17.7 us                      | 13.8 us: 1.28x faster (-22%) |
+-------------------------+------------------------------+------------------------------+
| richards                | 124 ms                       | 97.2 ms: 1.27x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| unpickle                | 23.9 us                      | 18.8 us: 1.27x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| chaos                   | 200 ms                       | 158 ms: 1.26x faster (-21%)  |
+-------------------------+------------------------------+------------------------------+
| hexiom                  | 17.6 ms                      | 14.0 ms: 1.26x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| logging_simple          | 15.8 us                      | 12.5 us: 1.26x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| nqueens                 | 179 ms                       | 142 ms: 1.26x faster (-20%)  |
+-------------------------+------------------------------+------------------------------+
| logging_silent          | 340 ns                       | 273 ns: 1.25x faster (-20%)  |
+-------------------------+------------------------------+------------------------------+
| crypto_pyaes            | 201 ms                       | 162 ms: 1.24x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| scimark_fft             | 653 ms                       | 527 ms: 1.24x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| scimark_monte_carlo     | 190 ms                       | 154 ms: 1.24x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| pickle_pure_python      | 795 us                       | 646 us: 1.23x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| go                      | 443 ms                       | 361 ms: 1.23x faster (-18%)  |
+-------------------------+------------------------------+------------------------------+
| deltablue               | 12.6 ms                      | 10.4 ms: 1.22x faster (-18%) |
+-------------------------+------------------------------+------------------------------+
| spectral_norm           | 245 ms                       | 201 ms: 1.22x faster (-18%)  |
+-------------------------+------------------------------+------------------------------+
| float                   | 203 ms                       | 167 ms: 1.21x faster (-18%)  |
+-------------------------+------------------------------+------------------------------+
| mako                    | 27.0 ms                      | 22.2 ms: 1.21x faster (-18%) |
+-------------------------+------------------------------+------------------------------+
| scimark_sor             | 347 ms                       | 286 ms: 1.21x faster (-17%)  |
+-------------------------+------------------------------+------------------------------+
| unpickle_pure_python    | 575 us                       | 475 us: 1.21x faster (-17%)  |
+-------------------------+------------------------------+------------------------------+
| fannkuch                | 803 ms                       | 667 ms: 1.20x faster (-17%)  |
+-------------------------+------------------------------+------------------------------+
| pathlib                 | 35.3 ms                      | 29.5 ms: 1.20x faster (-17%) |
+-------------------------+------------------------------+------------------------------+
| pyflate                 | 1.15 sec                     | 959 ms: 1.19x faster (-16%)  |
+-------------------------+------------------------------+------------------------------+
| sympy_expand            | 707 ms                       | 600 ms: 1.18x faster (-15%)  |
+-------------------------+------------------------------+------------------------------+
| regex_compile           | 303 ms                       | 258 ms: 1.18x faster (-15%)  |
+-------------------------+------------------------------+------------------------------+
| chameleon               | 15.7 ms                      | 13.3 ms: 1.18x faster (-15%) |
+-------------------------+------------------------------+------------------------------+
| sympy_str               | 461 ms                       | 394 ms: 1.17x faster (-15%)  |
+-------------------------+------------------------------+------------------------------+
| genshi_xml              | 104 ms                       | 88.4 ms: 1.17x faster (-15%) |
+-------------------------+------------------------------+------------------------------+
| dulwich_log             | 116 ms                       | 100 ms: 1.16x faster (-14%)  |
+-------------------------+------------------------------+------------------------------+
| sympy_integrate         | 34.4 ms                      | 29.9 ms: 1.15x faster (-13%) |
+-------------------------+------------------------------+------------------------------+
| genshi_text             | 49.1 ms                      | 42.9 ms: 1.15x faster (-13%) |
+-------------------------+------------------------------+------------------------------+
| 2to3                    | 535 ms                       | 471 ms: 1.14x faster (-12%)  |
+-------------------------+------------------------------+------------------------------+
| json_dumps              | 20.4 ms                      | 18.0 ms: 1.13x faster (-12%) |
+-------------------------+------------------------------+------------------------------+
| sympy_sum               | 285 ms                       | 252 ms: 1.13x faster (-12%)  |
+-------------------------+------------------------------+------------------------------+
| xml_etree_process       | 128 ms                       | 114 ms: 1.12x faster (-11%)  |
+-------------------------+------------------------------+------------------------------+
| sqlite_synth            | 4.75 us                      | 4.24 us: 1.12x faster (-11%) |
+-------------------------+------------------------------+------------------------------+
| telco                   | 10.1 ms                      | 8.98 ms: 1.12x faster (-11%) |
+-------------------------+------------------------------+------------------------------+
| meteor_contest          | 168 ms                       | 150 ms: 1.12x faster (-11%)  |
+-------------------------+------------------------------+------------------------------+
| sqlalchemy_imperative   | 53.3 ms                      | 47.7 ms: 1.12x faster (-11%) |
+-------------------------+------------------------------+------------------------------+
| tornado_http            | 425 ms                       | 382 ms: 1.11x faster (-10%)  |
+-------------------------+------------------------------+------------------------------+
| xml_etree_generate      | 159 ms                       | 144 ms: 1.10x faster (-9%)   |
+-------------------------+------------------------------+------------------------------+
| sqlalchemy_declarative  | 271 ms                       | 251 ms: 1.08x faster (-7%)   |
+-------------------------+------------------------------+------------------------------+
| json_loads              | 43.5 us                      | 40.4 us: 1.08x faster (-7%)  |
+-------------------------+------------------------------+------------------------------+
| python_startup          | 13.9 ms                      | 13.1 ms: 1.06x faster (-6%)  |
+-------------------------+------------------------------+------------------------------+
| unpickle_list           | 6.68 us                      | 6.29 us: 1.06x faster (-6%)  |
+-------------------------+------------------------------+------------------------------+

Scope

  • Other developers are encouraged to check if their package works as expected
  • Release engineering: N/A (not needed for this Change) -- this change does not require a mass rebuild nor any other special releng work
  • Policies and guidelines: N/A (not needed for this Change)
  • Trademark approval: N/A (not needed for this Change)

Upgrade/compatibility impact

Python package maintainers should verify that their packages work as expected and the only impact the end users should see is a performance increase for workloads relying on Python.

How To Test

Test that everything Python related in Fedora works as usual.

Was the flag applied test

You can test whether the -fno-semantic-interposition flag was applied for your Python build:

>>> import sysconfig
>>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_CFLAGS') + sysconfig.get_config_var('PY_CFLAGS_NODIST'))
True
>>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_LDFLAGS') + sysconfig.get_config_var('PY_LDFLAGS_NODIST'))
True

Before the change, you would see False, False.

Performance test

The performance speedup can be measured using the official Python benchmark suite pyperformance: see Run benchmarks.

User Experience

Python based workloads should see a performance gain of up to 27%.

Dependencies

This change is not dependent on anything else.

Contingency Plan

  • Contingency mechanism: If issues appear that cannot be fixed in a timely manner the change can be easily reverted and will be considered again for the next fedora release.
  • Contingency deadline: Before the beta freeze of Fedora 32 (2020-02-25)
  • Blocks release? Yes
  • Blocks product? None

Documentation

This change proposal has all the documentation.

See the previous change proposal and the thread about it on the devel mailing list for more relevant information about what we are not doing

Release Notes

TBD. Be sure to mention PEP 445 and 454.