No edit summary |
|||
Line 1: | Line 1: | ||
Some speed optimization ideas for yum | = Some speed optimization ideas for yum = | ||
* use Cython to compile one or more of the .py files to .c code and compile them into DSOs | * use Cython to compile one or more of the .py files to .c code and compile them into DSOs | ||
* use PyPy; would require building out a full PyPy stack: an alternative implementation of Python. Last time I looked a the generated .c code, I wasn't comfortable debugging the result (I didn't feel that debugging a crash in the result would be feasible at 3am) | * use PyPy; would require building out a full PyPy stack: an alternative implementation of Python. Last time I looked a the generated .c code, I wasn't comfortable debugging the result (I didn't feel that debugging a crash in the result would be feasible at 3am) | ||
Line 9: | Line 9: | ||
I've looked at the generated code and it seems debuggable; I'd be able to debug issues arising. | I've looked at the generated code and it seems debuggable; I'd be able to debug issues arising. | ||
= Example of generated code = | = Notes on Cython = | ||
From upstream, on one simple example: "Simply compiling this in Cython merely gives a 35% speedup. This is better than nothing, but adding some static types can make a much larger difference." | |||
The approach I'm currently thinking of is to simply compile the .py files with Cython, without using any of the non-Python extensions that Cython adds. | |||
This would ensure that yum can continue to be usable without Cython: no non-standard language features. | |||
It might perhaps be appropriate to add targeted optimizations to Cython, so that it does a better job when handling yum as input. | |||
Going further, we could perhaps introduce pragma-style comments to yum sources e.g | |||
# cython: isinstance(foo, dict) | |||
which Cython could take as an optimization hint to generate specialized code for handling the case when foo is a PyDictObject (might be more productive to support exact type matches here, though). | |||
== Example of generated code == | |||
See [http://dmalcolm.fedorapeople.org/python-packaging/depsolve.html depsolve.html]. You can see the generated .c code by clicking on the yellow-colored .py code. This was generated using the "-a" option to Cython. Note that this was generated using a development copy of Cython, which has the support for some lambdas. | See [http://dmalcolm.fedorapeople.org/python-packaging/depsolve.html depsolve.html]. You can see the generated .c code by clicking on the yellow-colored .py code. This was generated using the "-a" option to Cython. Note that this was generated using a development copy of Cython, which has the support for some lambdas. | ||
Note in particular that the generated C code has comments throughout indicating which line of .py code (in context) it corresponds to: this is important for my own comfort level, in feeling that this is supportable. | Note in particular that the generated C code has comments throughout indicating which line of .py code (in context) it corresponds to: this is important for my own comfort level, in feeling that this is supportable. | ||
= | == Optimization potential == | ||
Why would it be faster to simply compile with Cython? | |||
* it avoids the bytecode dispatch loop: operations by the module are "unrolled" directly into libpython API calls, and this can be individually optimized by the C compiler: error handling can be marked as unlikely, and the unrolled .c code should give us better CPU branch prediction; the result should also be more directly amenable to further optimization work: C-level profiling tools such as oprofile would indicate specifically where we're spending in the .py code. | |||
* it avoids some stack manipulation: temporaries are stored directly as locals within the .c code, and twiddling of ob_refcnt fields can be reduced | |||
Using Cython "bakes in" some values for builtins: calls to the builtin "len" are turned directly into calls to PyObject_Length, rather than doublechecking each time what the value of __builtins__.len is, and calling it. So this is a semantic difference from regular Python, and some monkey-patching is ruled out, but I think it's a reasonable optimization. | |||
== Pessimization risks == | |||
* looking at _sort_reqs(), the line | |||
<pre>mapper = {'EQ' : 1, 'LT' : 2, 'LE' : 3, 'GT' : 4, 'GE' : 5,</pre> uses a dictionary which is never modified. The Cython-generated C code is rebuilding this dictionary every time the function is called. As it turns out, Python 2.6's peephole optimizer doesn't optimize this either: | |||
<pre> | |||
>>> dis(depsolve.Depsolve._sort_reqs) | |||
811 0 BUILD_MAP 6 | |||
3 LOAD_CONST 1 (1) | |||
6 LOAD_CONST 2 ('EQ') | |||
9 STORE_MAP | |||
10 LOAD_CONST 3 (2) | |||
13 LOAD_CONST 4 ('LT') | |||
16 STORE_MAP | |||
17 LOAD_CONST 5 (3) | |||
20 LOAD_CONST 6 ('LE') | |||
23 STORE_MAP | |||
24 LOAD_CONST 7 (4) | |||
27 LOAD_CONST 8 ('GT') | |||
30 STORE_MAP | |||
31 LOAD_CONST 9 (5) | |||
34 LOAD_CONST 10 ('GE') | |||
37 STORE_MAP | |||
</pre> | |||
TODO: | TODO: | ||
* measure the impact of using a Cython .c build of depsolve.py | * measure the impact of using a Cython .c build of depsolve.py | ||
** try building with Cython | ** try building with Cython |
Revision as of 16:54, 24 August 2010
Some speed optimization ideas for yum
- use Cython to compile one or more of the .py files to .c code and compile them into DSOs
- use PyPy; would require building out a full PyPy stack: an alternative implementation of Python. Last time I looked a the generated .c code, I wasn't comfortable debugging the result (I didn't feel that debugging a crash in the result would be feasible at 3am)
- use Unladen Swallow for Python 2: would require porting the US 2.6 stack to 2.7, and a separate Python stack
- use Unladen Swallow for Python 3: wait until it gets merged (in Python 3.3); port yum to python 3
Using Cython seems to be the least invasive approach.
I've looked at the generated code and it seems debuggable; I'd be able to debug issues arising.
Notes on Cython
From upstream, on one simple example: "Simply compiling this in Cython merely gives a 35% speedup. This is better than nothing, but adding some static types can make a much larger difference."
The approach I'm currently thinking of is to simply compile the .py files with Cython, without using any of the non-Python extensions that Cython adds.
This would ensure that yum can continue to be usable without Cython: no non-standard language features.
It might perhaps be appropriate to add targeted optimizations to Cython, so that it does a better job when handling yum as input.
Going further, we could perhaps introduce pragma-style comments to yum sources e.g
# cython: isinstance(foo, dict)
which Cython could take as an optimization hint to generate specialized code for handling the case when foo is a PyDictObject (might be more productive to support exact type matches here, though).
Example of generated code
See depsolve.html. You can see the generated .c code by clicking on the yellow-colored .py code. This was generated using the "-a" option to Cython. Note that this was generated using a development copy of Cython, which has the support for some lambdas.
Note in particular that the generated C code has comments throughout indicating which line of .py code (in context) it corresponds to: this is important for my own comfort level, in feeling that this is supportable.
Optimization potential
Why would it be faster to simply compile with Cython?
- it avoids the bytecode dispatch loop: operations by the module are "unrolled" directly into libpython API calls, and this can be individually optimized by the C compiler: error handling can be marked as unlikely, and the unrolled .c code should give us better CPU branch prediction; the result should also be more directly amenable to further optimization work: C-level profiling tools such as oprofile would indicate specifically where we're spending in the .py code.
- it avoids some stack manipulation: temporaries are stored directly as locals within the .c code, and twiddling of ob_refcnt fields can be reduced
Using Cython "bakes in" some values for builtins: calls to the builtin "len" are turned directly into calls to PyObject_Length, rather than doublechecking each time what the value of __builtins__.len is, and calling it. So this is a semantic difference from regular Python, and some monkey-patching is ruled out, but I think it's a reasonable optimization.
Pessimization risks
- looking at _sort_reqs(), the line
mapper = {'EQ' : 1, 'LT' : 2, 'LE' : 3, 'GT' : 4, 'GE' : 5,
uses a dictionary which is never modified. The Cython-generated C code is rebuilding this dictionary every time the function is called. As it turns out, Python 2.6's peephole optimizer doesn't optimize this either:
>>> dis(depsolve.Depsolve._sort_reqs) 811 0 BUILD_MAP 6 3 LOAD_CONST 1 (1) 6 LOAD_CONST 2 ('EQ') 9 STORE_MAP 10 LOAD_CONST 3 (2) 13 LOAD_CONST 4 ('LT') 16 STORE_MAP 17 LOAD_CONST 5 (3) 20 LOAD_CONST 6 ('LE') 23 STORE_MAP 24 LOAD_CONST 7 (4) 27 LOAD_CONST 8 ('GT') 30 STORE_MAP 31 LOAD_CONST 9 (5) 34 LOAD_CONST 10 ('GE') 37 STORE_MAP
TODO:
- measure the impact of using a Cython .c build of depsolve.py
- try building with Cython