From Fedora Project Wiki
Line 63: Line 63:
== Various optimization notes ==
== Various optimization notes ==
Profile! Profile! Profile!
Profile! Profile! Profile!
* lots of ideas here; approach: profile yum with a Cython build and find the slow spots, and add optimizations to Cython to generate better .c code
* foo.has_key() is generating a function all each time; might perhaps test for dictionaries and have a specialized implementation that goes directly to PyDict_ API  (though arguably this is a semantic change)
* foo.has_key() is generating a function all each time; might perhaps test for dictionaries and have a specialized implementation that goes directly to PyDict_ API  (though arguably this is a semantic change)
* const string literal  "" % (args): it's doing a PyNumber_Remainder; it could use typechecking and convert it in-place to a PyString_Format call, avoiding various indirections  (much harder: compile this down to specialized C-code, based on the exact format strings)  profile!!!  This is a special-case of specialization of binops for where we know the types: would this be a new optimization for Cython?  BinopNode.
* const string literal  "" % (args): it's doing a PyNumber_Remainder; it could use typechecking and convert it in-place to a PyString_Format call, avoiding various indirections  (much harder: compile this down to specialized C-code, based on the exact format strings)  profile!!!  This is a special-case of specialization of binops for where we know the types: would this be a new optimization for Cython?  BinopNode.

Revision as of 18:12, 24 August 2010

Some speed optimization ideas for yum

  • use Cython to compile one or more of the .py files to .c code and compile them into DSOs
  • use PyPy; would require building out a full PyPy stack: an alternative implementation of Python. Last time I looked a the generated .c code, I wasn't comfortable debugging the result (I didn't feel that debugging a crash in the result would be feasible at 3am)
  • use Unladen Swallow for Python 2: would require porting the US 2.6 stack to 2.7, and a separate Python stack
  • use Unladen Swallow for Python 3: wait until it gets merged (in Python 3.3); port yum to python 3

Using Cython seems to be the least invasive approach.

I've looked at the generated code and it seems debuggable; I'd be able to debug issues arising.

Notes on Cython

From upstream, on one simple example: "Simply compiling this in Cython merely gives a 35% speedup. This is better than nothing, but adding some static types can make a much larger difference."

The approach I'm currently thinking of is to simply compile the .py files with Cython, without using any of the non-Python extensions that Cython adds.

This would ensure that yum can continue to be usable without Cython: no non-standard language features.

It might perhaps be appropriate to add targeted optimizations to Cython, so that it does a better job when handling yum as input.

Going further, we could perhaps introduce pragma-style comments to yum sources e.g

  # cython: isinstance(foo, dict)

which Cython could take as an optimization hint to generate specialized code for handling the case when foo is a PyDictObject (might be more productive to support exact type matches here, though).

Example of generated code

See depsolve.html. You can see the generated .c code by clicking on the yellow-colored .py code. This was generated using the "-a" option to Cython. Note that this was generated using a development copy of Cython, which has the support for some lambdas.

Note in particular that the generated C code has comments throughout indicating which line of .py code (in context) it corresponds to: this is important for my own comfort level, in feeling that this is supportable.

Optimization potential

Why would it be faster to simply compile with Cython?

  • it avoids the bytecode dispatch loop: operations by the module are "unrolled" directly into libpython API calls, and this can be individually optimized by the C compiler: error handling can be marked as unlikely, and the unrolled .c code should give us better CPU branch prediction; the result should also be more directly amenable to further optimization work: C-level profiling tools such as oprofile would indicate specifically where we're spending in the .py code.
  • it avoids some stack manipulation: temporaries are stored directly as locals within the .c code, and twiddling of ob_refcnt fields can be reduced

Using Cython "bakes in" some values for builtins: calls to the builtin "len" are turned directly into calls to PyObject_Length, rather than doublechecking each time what the value of __builtins__.len is, and calling it. So this is a semantic difference from regular Python, and some monkey-patching is ruled out, but I think it's a reasonable optimization.

Pessimization risks

  • looking at _sort_reqs(), the line
mapper = {'EQ' : 1, 'LT' : 2, 'LE' : 3, 'GT' : 4, 'GE' : 5,

uses a dictionary which is never modified. The Cython-generated C code is rebuilding this dictionary every time the function is called. As it turns out, Python 2.6's peephole optimizer doesn't optimize this either:

>>> dis(depsolve.Depsolve._sort_reqs)
811           0 BUILD_MAP                6
              3 LOAD_CONST               1 (1)
              6 LOAD_CONST               2 ('EQ')
              9 STORE_MAP           
             10 LOAD_CONST               3 (2)
             13 LOAD_CONST               4 ('LT')
             16 STORE_MAP           
             17 LOAD_CONST               5 (3)
             20 LOAD_CONST               6 ('LE')
             23 STORE_MAP           
             24 LOAD_CONST               7 (4)
             27 LOAD_CONST               8 ('GT')
             30 STORE_MAP           
             31 LOAD_CONST               9 (5)
             34 LOAD_CONST              10 ('GE')
             37 STORE_MAP           

TODO:

  • measure the impact of using a Cython .c build of depsolve.py
    • try building with Cython

Various optimization notes

Profile! Profile! Profile!

  • lots of ideas here; approach: profile yum with a Cython build and find the slow spots, and add optimizations to Cython to generate better .c code
  • foo.has_key() is generating a function all each time; might perhaps test for dictionaries and have a specialized implementation that goes directly to PyDict_ API (though arguably this is a semantic change)
  • const string literal "" % (args): it's doing a PyNumber_Remainder; it could use typechecking and convert it in-place to a PyString_Format call, avoiding various indirections (much harder: compile this down to specialized C-code, based on the exact format strings) profile!!! This is a special-case of specialization of binops for where we know the types: would this be a new optimization for Cython? BinopNode.
  • min() builtin is being invoked on a pair of ints, when it could just inline directly: (in _common_prefix_len(): num = min(len(x), len(y)) ); profile, profile, profile!!!
  • quite a lot of complexity in parsing the arguments as well. Perhaps we could directly inline functions, with a fast path that detects when a function call is calling the known function, and avoid the thunking at both ends??? profile profile profile!
  • cython-0.12.1 (in F13) didn't do some const-propagation/folding, but this seems to be fixed in cython-devel: transactioninfo.py has:
 return txmember.ts_state in ('u', 'i')
 and not isinstance(txmember.po, (YumInstalledPackage, YumAvailablePackageSqlite))

and the 0.12.1-generated C was constructing the tuples: ('u', 'i') and (YumInstalledPackage, YumAvailablePackageSqlite) every time; the former can be constant; the latest devel version of Cython does, reducing it to a pair of comparisons for equality against const "u" and const "i"; see Cython/Compiler/Optimize.py (and we could profile yum and find further optimizations)