Some speed optimization ideas for yum
- use Cython to compile one or more of the .py files to .c code and compile them into DSOs
- use PyPy; would require building out a full PyPy stack: an alternative implementation of Python. Last time I looked a the generated .c code, I wasn't comfortable debugging the result (I didn't feel that debugging a crash in the result would be feasible at 3am)
- use Unladen Swallow for Python 2: would require porting the US 2.6 stack to 2.7, and a separate Python stack
- use Unladen Swallow for Python 3: wait until it gets merged (in Python 3.3); port yum to python 3
Using Cython seems to be the least invasive approach.
I've looked at the generated code and it seems debuggable; I'd be able to debug issues arising.
Notes on Cython
From upstream, on one simple example: "Simply compiling this in Cython merely gives a 35% speedup. This is better than nothing, but adding some static types can make a much larger difference."
The approach I'm currently thinking of is to simply compile the .py files with Cython, without using any of the non-Python extensions that Cython adds.
This would ensure that yum can continue to be usable without Cython: no non-standard language features.
It might perhaps be appropriate to add targeted optimizations to Cython, so that it does a better job when handling yum as input.
Going further, we could perhaps introduce pragma-style comments to yum sources e.g
# cython: isinstance(foo, dict)
which Cython could take as an optimization hint to generate specialized code for handling the case when foo is a PyDictObject (might be more productive to support exact type matches here, though).
Example of generated code
See depsolve.html. You can see the generated .c code by clicking on the yellow-colored .py code. This was generated using the "-a" option to Cython. Note that this was generated using a development copy of Cython, which has the support for some lambdas.
Note in particular that the generated C code has comments throughout indicating which line of .py code (in context) it corresponds to: this is important for my own comfort level, in feeling that this is supportable.
Optimization potential
Why would it be faster to simply compile with Cython?
- it avoids the bytecode dispatch loop: operations by the module are "unrolled" directly into libpython API calls, and this can be individually optimized by the C compiler: error handling can be marked as unlikely, and the unrolled .c code should give us better CPU branch prediction; the result should also be more directly amenable to further optimization work: C-level profiling tools such as oprofile would indicate specifically where we're spending in the .py code.
- it avoids some stack manipulation: temporaries are stored directly as locals within the .c code, and twiddling of ob_refcnt fields can be reduced
Using Cython "bakes in" some values for builtins: calls to the builtin "len" are turned directly into calls to PyObject_Length, rather than doublechecking each time what the value of __builtins__.len is, and calling it. So this is a semantic difference from regular Python, and some monkey-patching is ruled out, but I think it's a reasonable optimization.
Pessimization risks
- looking at _sort_reqs(), the line
mapper = {'EQ' : 1, 'LT' : 2, 'LE' : 3, 'GT' : 4, 'GE' : 5,
uses a dictionary which is never modified. The Cython-generated C code is rebuilding this dictionary every time the function is called. As it turns out, Python 2.6's peephole optimizer doesn't optimize this either:
>>> dis(depsolve.Depsolve._sort_reqs) 811 0 BUILD_MAP 6 3 LOAD_CONST 1 (1) 6 LOAD_CONST 2 ('EQ') 9 STORE_MAP 10 LOAD_CONST 3 (2) 13 LOAD_CONST 4 ('LT') 16 STORE_MAP 17 LOAD_CONST 5 (3) 20 LOAD_CONST 6 ('LE') 23 STORE_MAP 24 LOAD_CONST 7 (4) 27 LOAD_CONST 8 ('GT') 30 STORE_MAP 31 LOAD_CONST 9 (5) 34 LOAD_CONST 10 ('GE') 37 STORE_MAP
TODO:
- measure the impact of using a Cython .c build of depsolve.py
- try building with Cython
Various optimization notes
Profile! Profile! Profile!
- lots of ideas here; approach: profile yum with a Cython build and find the slow spots, and add optimizations to Cython to generate better .c code
- foo.has_key() is generating a function all each time; might perhaps test for dictionaries and have a specialized implementation that goes directly to PyDict_ API (though arguably this is a semantic change)
- const string literal "" % (args): it's doing a PyNumber_Remainder; it could use typechecking and convert it in-place to a PyString_Format call, avoiding various indirections (much harder: compile this down to specialized C-code, based on the exact format strings) profile!!! This is a special-case of specialization of binops for where we know the types: would this be a new optimization for Cython? BinopNode.
- min() builtin is being invoked on a pair of ints, when it could just inline directly: (in _common_prefix_len(): num = min(len(x), len(y)) ); profile, profile, profile!!!
- quite a lot of complexity in parsing the arguments as well. Perhaps we could directly inline functions, with a fast path that detects when a function call is calling the known function, and avoid the thunking at both ends??? profile profile profile!
- cython-0.12.1 (in F13) didn't do some const-propagation/folding, but this seems to be fixed in cython-devel: transactioninfo.py has:
return txmember.ts_state in ('u', 'i')
and not isinstance(txmember.po, (YumInstalledPackage, YumAvailablePackageSqlite))
and the 0.12.1-generated C was constructing the tuples: ('u', 'i') and (YumInstalledPackage, YumAvailablePackageSqlite) every time; the former can be constant; the latest devel version of Cython does, reducing it to a pair of comparisons for equality against const "u" and const "i"; see Cython/Compiler/Optimize.py (and we could profile yum and find further optimizations)