George's GSOC Blog: June 2010

Wednesday, June 30, 2010

Optimization changes, somewhat promising test results

I replaced the KeyError exception with a conditional inside the matcher's main loop, turns out it's actually slower. I also added a flag on each node to avoid revisiting higher nodes on each run, which was actually a significant speedup since it reduces the comparisons by ~30%! Some new test results now look promising. I'm currently focusing on single-file runs since they are simpler. Running the command below:

time for((c=1;c<=20;c++)) do

./test2to3.sh -f isinstance -f raise -f zip -f raw_input -f buffer \

  ../django-3k/django/core/servers/basehttp.py

done

yielded: 

real 0m10.324s 
user 0m9.709s 
sys 0m0.292s  

The results of a control run with the original code were these:

real 0m11.237s 
user 0m10.701s 
sys 0m0.308s

(These were actually the best out of 10 runs, given time's inconsistency; results were still lower on average. I'll redo the tests with profile when I'll have more time)

That's only a slight improvement and there's a catch, basehttp.py is quite a large file. But I'm happy with the fact that it can at least run with roughly the same speed.

An issue I'm currently concerned with is that really common subpatterns may add a significant overhead when they are matched. For example )->trailer->power may be matched a few hundred times; this means that we'll have to update the match sets of perhaps 20-30 fixers, and check if the fixer matches every single time.

If we remove common subpatterns, it will probably speed things up(for example, for the case of raw_input we can throw away all subpatterns except the one starting with 'raw_input'); but then the matcher won't be 100% accurate(although only in very few cases on well-formed code). It can however send candidate nodes to the existing matcher(that's what it does anyway right now to fill the pattern variables) which can reject them. Plus this could also deal with negated patterns; just ignore them and the original matcher will do the work. I'll give it a try next week to see if it justifies the added complexity.

Plus I changed the blog's template to something that won't hurt your eyes as much.

Monday, June 28, 2010

First version of the bottom matcher

I just committed the first draft version of the bottom matching algorithm. It includes the basic module that builds & runs the automaton, plus a separate module with utility functions.

What works:

raw_input, buffer etc. all fixers that can have their wildcards reduced
isinstance, and all fixers with wildcards at the bottom

What doesn't:

Alternatives: they seem easy to implement, will be added next week
Wildcards at higher levels of the pattern: somewhat rare, left out for now (see the ws_comma for a case)
Negative patterns: matching them should prevent the whole pattern from being matched. They rely on alternatives, so they are scheduled after them

The code is a bit hackish at points, using module globals to grab the AST leaves etc. But as I said, it's a draft. No optimization effort has been made since this was mainly a proof of concept. I'm planning for a rewrite when most functionality is complete, and the algorithm is proven to be faster.

Still, I was a bit disappointed that it ran slightly slower than the current version. I was expecting at least a tiny improvement. I expect the automaton to pick up some speed when it's properly simulated by a table, as it is currently run by traversing linked nodes. Plus we can use threads to... OK, I'm just kidding:)

The code enters at refactor.py:RefactoringTool.refactor_string, where refactor_tree is replaced with the new refactor_tree_btm. Some (horrible) changes have been made to pytree.py. The code can be found at the repository: http://code.google.com/p/2to3-speedup2/

Tuesday, June 22, 2010

A preview of the fixer automaton and current progress

I hoped to have a prototype ready by this weekend, but school got in the way and I couldn't find the time to complete it. The automaton can currently add constant(without wildcards) subpatterns to itself and run on a tree. I'm currently working on a function that converts a compiled pattern tree to its subpatterns in order to test more fixers, which turned out to be much harder than I thought since Base/Node/WildcardPattern objects were not really made for this. But I'm close. Anyway, I've made a graph of the current automaton for the raw_input fixer in order to make the process more clear:

The pattern was power< name='raw_input' trailer< '(' [any] ')' > any* > . The pattern was reduced(by hand) to it's minimum subpatterns:

{ '('->trailer->power, raw_input->power, ')'->trailer->power }

from which the above automaton was built. Now consider adding the very similar pattern for the buffer fixer. The resulting automaton would look like this:

The automaton begins for each leaf, and when each subpattern is matched the current node is marked with its id. When all subpatterns match on a single node for a fixer, the whole fixer matches. The patterns for other fixers will be added on the same automaton. There are some subtleties here: wildcard subpatterns need to be split at the wildcard positions and added as new subpatterns on all nodes starting with the next token, plus once on the root; alternatives must have their destination nodes combined on a single subpattern; repeaters with constant numbers need to be multiply matched etc. But overall, I believe that each issue can be handled without restricting the current fixer syntax.

Sunday, June 13, 2010

Progress Report

Exams start tomorrow and I've spent most of this week studying. But I've had the chance to start coding a prototype of the matching algorithm I proposed. It is completed to the point where it simulates the raw_input fixer pattern. Profiling results look promising; for example for this one fixer it took about 0.087 CPU seconds on a single file(preprocessing aside).

Of course running this single fixer with the current algorithm took about 0.018 seconds. But the new algorithm should behave much better when more fixer patterns are added, with processing time remaining close to the current level, so all in all I find this promising.

Let me explain. In theory, bottom-up tree traversal takes number_of_leaves*tree_height steps. Say b is the maximum number of children in a node and h is the height then this becomes (b^h)*h. h is about logn so the the traversal is in the O(n*logn) class. Of course the constant numbers depend on the characteristics of each tree. So, since the automaton runs in constant time on each step the number of comparisons is still O(n*logn).

The current version of the algorithm runs in O(n*pattern_nodes). Since pattern_nodes can be considered a constant number compared to n, this is O(n). Repeaters can add an overhead here when backtracking, depending on their position on the pattern_tree. But let's also consider this a constant factor, just for simplicity(I believe they add a b^h*h factor too; I have no time to check this).

The important thing to note is that the algorithm depends on the sizes of the patterns, which means that matching takes longer each time a pattern is added. The proposed algorithm is independent of the pattern sizes(except for the preprocessing part), so this is why I expect this number to stay close to the current level.

While writing this prototype I came to the realization that it is unnecessary to store the leaf-to-root strings; we can simply keep a set of the tree leaves and traverse them by asking each one for its parent. When a fixer transforms the tree, it currently marks the nodes that it changed. This can be extended to propagate the "changed" label to its node's children, and we can use this information to recheck only their descendant leaves.

A couple of other realizations I came to: patterns with wildcards need to be split up, then the subpatterns are added to the automaton skipping x states, where x is the subpattern's position. I'm working on the details of the implementation. Second, there is a great potential for threading on this algorithm, each thread can simulate the automaton on a different leaf in parallel so this will be something to consider in the future.

Monday, June 7, 2010

Some benchmarking results, and an implementation plan

Well I'm still not able to run profile for building django-3k and I'm not sure what it is that I'm doing wrong. But anyway, at least I got it to work for a single file. I used feedgenerator.py from the the django-3k codebase and here is the result, using gprof2dot:

No surprises here, the program spends most of its in the pattern matching functions, as it was already known(the green boxes down-left). Since a significant overhead in python is function calls, I'm still confident that a more iterative pattern matching algorithm that will avoid the number of recursions will be able to speed things up. Anyway, as a first reference here are the results of time:

Command being timed: "/usr/local/bin/python3 /home/giorgos/dev/gsoc/django-3k/setup.py build"
User time (seconds): 386.96
System time (seconds): 0.87
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 6:37.14
....

Note that I've built python with threads disabled, because they often mess up the profiler. But it shouldn't matter, results are always relative and threads are not used internally in 2to3.

Now, analyzing the algorithm I came up with, I'm pretty sure that theoretically it should offer no benefit compared to the existing one, i.e. I believe it runs the same number of comparisons. Of course preprocessing a few parts may help overall, but still. On the other hand, as I mentioned above I think a major issue in the existing algorithm is its recursive nature; the algorithm I propose may benefit from being a bit more "linear".

Anyway, running a few tests it seems to me that the most (practically) computationally intensive part is obtaining the bottom-up strings, and doing so every time the tree is altered. Profiling showed that it can take up to 0.5 secs to obtain the full set; that's quite a lot, and usually 1-2 fixers are applied per file so multiply that by 2. That's not so bad, but I'll see in the future if there is a faster way to obtain it.

So a portion the algorithm in pseudocode currently is something like this:

get_bottom-strings:
#Bottom strings will be lists of the format[(NodeId, TOKENTYPE), ...].
  for leave in leaves:
    return leave.type + parent.bottom_string() #recursively obtain bottom-up strings

get_pattern_dict:
  for fixer in fixers:
    reduce_repeaters(fixer.pattern_tree) # reduce repeaters to obtain a constant tree
        -> WildcardPattern is reduced to its MIN; this may actually create nodes
    for leave in pattern_tree.leaves:
      obtain bottom-up string
    return pattern_dict

Then we use the pattern_dictionaries to build the AC automaton. Now simulating the AC automaton in python may be too slow for our purposes here, so perhaps a C version could be considered in the future if that's an option. The automaton is constructed by adding each string at a time as detailed in the Aho-Corasick paper, and nodes will be represented by a ACNode type with a dictionary as a transition table. On each final node callbacks will be added to mark the position of the match. Finally a match dictionary with nodeId as the key will be used to check whether at any point a fixer has been matched.

So that pretty much covers it. I'm at the stage of planning how to fit all these in the existing code. I hope I'll have some time this week(my term exams start on Monday) to push this further.

George's GSOC Blog