From Fedora Project Wiki

Revision as of 13:38, 9 June 2021 by Cstratak (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Build Python with -fno-semantic-interposition for better performance

Simplified version of another change proposal
This change was originally proposed for Fedora 32 as Changes/PythonStaticSpeedup, however based on community feedback, it has been significantly reduced.

Summary

We add the -fno-semantic-interposition compiler flag when building Python interpreters, as it provides significant performance improvement, up to 27% depending on the workload. Users will no longer be able to use LD_PRELOAD to override a symbol from libpython, which we consider a good trade off for the speedup.

Owner

Current status

Detailed Description

When we build the Python interpreter with the -fno-semantic-interposition compiler/linker flag, we can achieve a performance gain of 5% to 27% depending on the workload. Link time optimizations and profile guided optimizations also have a greater impact when python3 is built this way.

For a vague-linkage function definition, a call site in the same translation unit may inline the callee. Whether -fno-semantic-interposition is enabled has no effect.

For a non-vague-linkage function definition, by default (-fsemantic-interposition) the -fpic mode does not allow a call site in the same translation unit to inline the callee or perform other interprocedural optimizations. -fno-semantic-interposition re-enables interprocedural optimizations.

If a caller inlines a callee, using LD_PRELOAD to interpose the callee will not affect the caller. But many other LD_PRELOAD usage still work. We consider the small LD_PRELOAD limitation a good trade off for the speedup.

Interposition is enabled by default in compilers like GCC: function calls to a library goes through a "Procedure Linkage Table" (PLT). This indirection is required to allow a library loaded by LD_PRELOAD environment variable to override a function. The indirection puts more pressure on the CPU level 1 cache (instruction cache). In terms of performance, the main drawback is that function calls from a library to the same library cannot be inlined, to respect the interposition semantics. Inlining is usually a big win in terms of performance.

Disabling interposition for libpython removes the overhead on function calls by avoiding the PLT indirection, and allows to inline more function calls. We're describing function calls from libpython to libpython, something which is very common in Python: almost all function calls are calls from libpython to libpython.

If Fedora users need to use LD_PRELOAD to override symbols in libpython, the recommended way is to build a custom Python without -fno-semantic-interposition.

It is still possible to use LD_PRELOAD to override symbols in other libraries (for example in glibc).

Affected Pythons

Primarily, we will change the interpreter in the python3 package, that is Python 3.8 in Fedora 32 and any later version of Python in future Fedora releases.

Impact on other Python packages (and generally software using Python) is not anticipated (other than the possible speedup).

We will also change the alternate Python interpreters where possible and useful, primarily the upstream supported versions of CPython, such as python39 (if already packaged), python37 and python36.

Affected Fedora releases

This is a Fedora 32 change and it will be implemented in Rawhide (Fedora 32) only. Any future versions of Fedora will inherit the change until it is reverted for some reason.

If it turns out that there are absolutely no issues, we might consider backporting the speedup to already released Fedora versions (for example Fedora 31). Such action would be separately coordinated with FESCo.

Benefit to Fedora

Python's performance will increase significantly depending on the workload. Since many core components of the OS also depend on Python this could lead to an increase in their performance as well, however individual benchmarks will need to be conducted to verify the performance gain for those components.

pyperformance results, ignoring differences smaller than 5%:

+-------------------------+------------------------------+------------------------------+
| Benchmark               | python38-3.8.0-1 (original)  | python38-3.8.0-2 (changed)   |
+=========================+==============================+==============================+
| scimark_lu              | 294 ms                       | 213 ms: 1.38x faster (-27%)  |
+-------------------------+------------------------------+------------------------------+
| scimark_sparse_mat_mult | 8.61 ms                      | 6.39 ms: 1.35x faster (-26%) |
+-------------------------+------------------------------+------------------------------+
| nbody                   | 236 ms                       | 179 ms: 1.32x faster (-24%)  |
+-------------------------+------------------------------+------------------------------+
| django_template         | 203 ms                       | 158 ms: 1.29x faster (-22%)  |
+-------------------------+------------------------------+------------------------------+
| raytrace                | 910 ms                       | 709 ms: 1.28x faster (-22%)  |
+-------------------------+------------------------------+------------------------------+
| logging_format          | 17.7 us                      | 13.8 us: 1.28x faster (-22%) |
+-------------------------+------------------------------+------------------------------+
| richards                | 124 ms                       | 97.2 ms: 1.27x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| unpickle                | 23.9 us                      | 18.8 us: 1.27x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| chaos                   | 200 ms                       | 158 ms: 1.26x faster (-21%)  |
+-------------------------+------------------------------+------------------------------+
| hexiom                  | 17.6 ms                      | 14.0 ms: 1.26x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| logging_simple          | 15.8 us                      | 12.5 us: 1.26x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| nqueens                 | 179 ms                       | 142 ms: 1.26x faster (-20%)  |
+-------------------------+------------------------------+------------------------------+
| logging_silent          | 340 ns                       | 273 ns: 1.25x faster (-20%)  |
+-------------------------+------------------------------+------------------------------+
| crypto_pyaes            | 201 ms                       | 162 ms: 1.24x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| scimark_fft             | 653 ms                       | 527 ms: 1.24x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| scimark_monte_carlo     | 190 ms                       | 154 ms: 1.24x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| pickle_pure_python      | 795 us                       | 646 us: 1.23x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| go                      | 443 ms                       | 361 ms: 1.23x faster (-18%)  |
+-------------------------+------------------------------+------------------------------+
| deltablue               | 12.6 ms                      | 10.4 ms: 1.22x faster (-18%) |
+-------------------------+------------------------------+------------------------------+
| spectral_norm           | 245 ms                       | 201 ms: 1.22x faster (-18%)  |
+-------------------------+------------------------------+------------------------------+
| float                   | 203 ms                       | 167 ms: 1.21x faster (-18%)  |
+-------------------------+------------------------------+------------------------------+
| mako                    | 27.0 ms                      | 22.2 ms: 1.21x faster (-18%) |
+-------------------------+------------------------------+------------------------------+
| scimark_sor             | 347 ms                       | 286 ms: 1.21x faster (-17%)  |
+-------------------------+------------------------------+------------------------------+
| unpickle_pure_python    | 575 us                       | 475 us: 1.21x faster (-17%)  |
+-------------------------+------------------------------+------------------------------+
| fannkuch                | 803 ms                       | 667 ms: 1.20x faster (-17%)  |
+-------------------------+------------------------------+------------------------------+
| pathlib                 | 35.3 ms                      | 29.5 ms: 1.20x faster (-17%) |
+-------------------------+------------------------------+------------------------------+
| pyflate                 | 1.15 sec                     | 959 ms: 1.19x faster (-16%)  |
+-------------------------+------------------------------+------------------------------+
| sympy_expand            | 707 ms                       | 600 ms: 1.18x faster (-15%)  |
+-------------------------+------------------------------+------------------------------+
| regex_compile           | 303 ms                       | 258 ms: 1.18x faster (-15%)  |
+-------------------------+------------------------------+------------------------------+
| chameleon               | 15.7 ms                      | 13.3 ms: 1.18x faster (-15%) |
+-------------------------+------------------------------+------------------------------+
| sympy_str               | 461 ms                       | 394 ms: 1.17x faster (-15%)  |
+-------------------------+------------------------------+------------------------------+
| genshi_xml              | 104 ms                       | 88.4 ms: 1.17x faster (-15%) |
+-------------------------+------------------------------+------------------------------+
| dulwich_log             | 116 ms                       | 100 ms: 1.16x faster (-14%)  |
+-------------------------+------------------------------+------------------------------+
| sympy_integrate         | 34.4 ms                      | 29.9 ms: 1.15x faster (-13%) |
+-------------------------+------------------------------+------------------------------+
| genshi_text             | 49.1 ms                      | 42.9 ms: 1.15x faster (-13%) |
+-------------------------+------------------------------+------------------------------+
| 2to3                    | 535 ms                       | 471 ms: 1.14x faster (-12%)  |
+-------------------------+------------------------------+------------------------------+
| json_dumps              | 20.4 ms                      | 18.0 ms: 1.13x faster (-12%) |
+-------------------------+------------------------------+------------------------------+
| sympy_sum               | 285 ms                       | 252 ms: 1.13x faster (-12%)  |
+-------------------------+------------------------------+------------------------------+
| xml_etree_process       | 128 ms                       | 114 ms: 1.12x faster (-11%)  |
+-------------------------+------------------------------+------------------------------+
| sqlite_synth            | 4.75 us                      | 4.24 us: 1.12x faster (-11%) |
+-------------------------+------------------------------+------------------------------+
| telco                   | 10.1 ms                      | 8.98 ms: 1.12x faster (-11%) |
+-------------------------+------------------------------+------------------------------+
| meteor_contest          | 168 ms                       | 150 ms: 1.12x faster (-11%)  |
+-------------------------+------------------------------+------------------------------+
| sqlalchemy_imperative   | 53.3 ms                      | 47.7 ms: 1.12x faster (-11%) |
+-------------------------+------------------------------+------------------------------+
| tornado_http            | 425 ms                       | 382 ms: 1.11x faster (-10%)  |
+-------------------------+------------------------------+------------------------------+
| xml_etree_generate      | 159 ms                       | 144 ms: 1.10x faster (-9%)   |
+-------------------------+------------------------------+------------------------------+
| sqlalchemy_declarative  | 271 ms                       | 251 ms: 1.08x faster (-7%)   |
+-------------------------+------------------------------+------------------------------+
| json_loads              | 43.5 us                      | 40.4 us: 1.08x faster (-7%)  |
+-------------------------+------------------------------+------------------------------+
| python_startup          | 13.9 ms                      | 13.1 ms: 1.06x faster (-6%)  |
+-------------------------+------------------------------+------------------------------+
| unpickle_list           | 6.68 us                      | 6.29 us: 1.06x faster (-6%)  |
+-------------------------+------------------------------+------------------------------+

Scope

  • Other developers are encouraged to check if their package works as expected
  • Release engineering: N/A (not needed for this Change) -- this change does not require a mass rebuild nor any other special releng work
  • Policies and guidelines: N/A (not needed for this Change)
  • Trademark approval: N/A (not needed for this Change)

Upgrade/compatibility impact

Python package maintainers should verify that their packages work as expected and the only impact the end users should see is a performance increase for workloads relying on Python.

How To Test

Test that everything Python related in Fedora works as usual.

Was the flag applied test

You can test whether the -fno-semantic-interposition flag was applied for your Python build:

>>> import sysconfig
>>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_CFLAGS') + sysconfig.get_config_var('PY_CFLAGS_NODIST'))
True
>>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_LDFLAGS') + sysconfig.get_config_var('PY_LDFLAGS_NODIST'))
True

Before the change, you would see False, False.

Performance test

The performance speedup can be measured using the official Python benchmark suite pyperformance: see Run benchmarks.

User Experience

Python based workloads should see a performance gain of up to 27%.

Dependencies

This change is not dependent on anything else.

Contingency Plan

  • Contingency mechanism: If issues appear that cannot be fixed in a timely manner the change can be easily reverted and will be considered again for the next fedora release.
  • Contingency deadline: Before the beta freeze of Fedora 32 (2020-02-25)
  • Blocks release? Yes
  • Blocks product? None

Documentation

This change proposal has all the documentation.

See the previous change proposal and the thread about it on the devel mailing list for more relevant information about what we are not doing

Release Notes

TBD. Be sure to mention PEP 445 and 454.