Some speed optimization ideas for yum
- use Cython to compile one or more of the .py files to .c code and compile them into DSOs
- use PyPy; would require building out a full PyPy stack: an alternative implementation of Python. Last time I looked a the generated .c code, I wasn't comfortable debugging the result (I didn't feel that debugging a crash in the result would be feasible at 3am)
- use Unladen Swallow for Python 2: would require porting the US 2.6 stack to 2.7, and a separate Python stack
- use Unladen Swallow for Python 3: wait until it gets merged (in Python 3.3); port yum to python 3
Using Cython seems to be the least invasive approach.
I've looked at the generated code and it seems debuggable; I'd be able to debug issues arising.
Notes on Cython
From upstream, on one simple example: "Simply compiling this in Cython merely gives a 35% speedup. This is better than nothing, but adding some static types can make a much larger difference."
The approach I'm currently thinking of is to simply compile the .py files with Cython, without using any of the non-Python extensions that Cython adds.
This would ensure that yum can continue to be usable without Cython: no non-standard language features.
It might perhaps be appropriate to add targeted optimizations to Cython, so that it does a better job when handling yum as input.
Going further, we could perhaps introduce pragma-style comments to yum sources e.g
# cython: isinstance(foo, dict)
which Cython could take as an optimization hint to generate specialized code for handling the case when foo is a PyDictObject (might be more productive to support exact type matches here, though).
Example of generated code
See depsolve.html. You can see the generated .c code by clicking on the yellow-colored .py code. This was generated using the "-a" option to Cython. Note that this was generated using a development copy of Cython, which has the support for some lambdas.
Note in particular that the generated C code has comments throughout indicating which line of .py code (in context) it corresponds to: this is important for my own comfort level, in feeling that this is supportable.
Optimization potential
Why would it be faster to simply compile with Cython?
- it avoids the bytecode dispatch loop: operations by the module are "unrolled" directly into libpython API calls, and this can be individually optimized by the C compiler: error handling can be marked as unlikely, and the unrolled .c code should give us better CPU branch prediction; the result should also be more directly amenable to further optimization work: C-level profiling tools such as oprofile would indicate specifically where we're spending in the .py code.
- it avoids some stack manipulation: temporaries are stored directly as locals within the .c code, and twiddling of ob_refcnt fields can be reduced
Using Cython "bakes in" some values for builtins: calls to the builtin "len" are turned directly into calls to PyObject_Length, rather than doublechecking each time what the value of __builtins__.len is, and calling it. So this is a semantic difference from regular Python, and some monkey-patching is ruled out, but I think it's a reasonable optimization.
Pessimization risks
- looking at _sort_reqs(), the line
mapper = {'EQ' : 1, 'LT' : 2, 'LE' : 3, 'GT' : 4, 'GE' : 5,
uses a dictionary which is never modified. The Cython-generated C code is rebuilding this dictionary every time the function is called. As it turns out, Python 2.6's peephole optimizer doesn't optimize this either:
>>> dis(depsolve.Depsolve._sort_reqs) 811 0 BUILD_MAP 6 3 LOAD_CONST 1 (1) 6 LOAD_CONST 2 ('EQ') 9 STORE_MAP 10 LOAD_CONST 3 (2) 13 LOAD_CONST 4 ('LT') 16 STORE_MAP 17 LOAD_CONST 5 (3) 20 LOAD_CONST 6 ('LE') 23 STORE_MAP 24 LOAD_CONST 7 (4) 27 LOAD_CONST 8 ('GT') 30 STORE_MAP 31 LOAD_CONST 9 (5) 34 LOAD_CONST 10 ('GE') 37 STORE_MAP
TODO:
- measure the impact of using a Cython .c build of depsolve.py
- try building with Cython