George's GSOC Blog: 2010

Sunday, August 15, 2010

Code committed to the sandbox and an overview of the project

After porting the code to 2.x, which proved to be a somewhat tricky process, the code was committed in the 2to3 sandbox repository. This concludes the project on my part, although I will still be around to provide support for the new code, at least until it reaches a sufficient point of maturity.

The project was overall a great experience for me; the proposal was challenging on both the algorithmic and coding part. What I've really come to realize was how much the theoretical description of a problem can differ from it's practical applications. Matching tree regular expressions in an efficient way in 2to3's context went well beyond what the research articles I had in mind at first offered, and implementing a new algorithm that can retain it's soundness while optimizing it was a quite stressful, but rewarding, experience. Most of the effort was spent on preserving compatibility with the previous matcher; the fixers were made to match and transform code in assumption of a pre/post-order traversal of the AST and various issues had to be handled on a case by case basis.

Although the benefits proved substantial, the effort taken and the complexity of the process often made me think if the whole concept is worth a few machine cycles; admittedly, as it is also the case with 2to3 itself, optimizations suffer from a low impresiveness/complexity ratio that tends to drive people away from involvement and involved people to insanity. But in conclusion, I'm proud(and, to be honest, still find it hard to believe) that this thing actually worked.

Wednesday, July 21, 2010

Support for all default fixers and some great results!

The "characteristic path" approach proved excellent. Not only it proved sufficient for alternatives support, it also greatly simplified and speeded up the matcher. For one thing it eliminated the need for wildcards; it turns out that any meaningful pattern always seems to have a constant part that can be matched instead. Negated patterns are skipped as well. The results from running the new 2to3 on django-3k were:

real 1m39.585s
user 1m36.218s
sys 0m0.660s

The control run took:

real 3m18.960s
user 3m6.820s
sys 0m0.920s

That's around a 50% speedup. However there are still some bugs, and the output from the old and new algorithm won't match 100%(although in relatively few cases). Most of them look minor and I'll try fixing them this week. Re-transformed code is kind of an issue, and the current approach uses the old matcher for the transformed nodes(which adds around 10sec). And in some cases the old algorithm seems to be in error, changing for example x.values() to list(list(x.values())) instead of list(x.values()) (although I may be wrong here, I'll check again with other versions).

'any' at higher levels is not supported, so the two(non-default) fixers that use won't work. I'm not sure if they are worth it, since supporting them may complicate the code and slow down the matcher. I'll look into that too.

I've uploaded the code to the repository. You can test it using the instructions from the previous post(although there are now no debugging messages). The code is a mess of course and I'll try to clean it up the next days. But overall I'm really satisfied with these results, I think I've made some good progress and there is still plenty of time for more optimizations. Off for a beer.

Tuesday, July 20, 2010

Progress for this week

Just a quick update before today's meeting, I've spent the last week trying approaches to support alternatives and I had to start from scratch twice. I'll certainly won't use the word easy so freely again.

First of all realized that it's impossible to deduct the full pattern structure reliably from the compiled pattern, so spent much of my time into rewriting this part to deduct the subpatterns from the intermediate tree that the pattern compiler generates. There were some challenges, but I think I've reached the point where patterns can be extracted reliably.

Now, my first approach was converting a pattern with alternatives into multiple simpler patterns, creating n^b new patterns, n being the nesting factor and b the number of alternatives per level of nesting. Needless to say that this quickly exploded the number of patterns so I quickly abandoned it.

The second approach was adding multiple transitions at nodes that correspond to alternatives. This approach is a bit more complex; for one thing there are cases where a single pattern is alternative to a group of patterns

i.e. ( a | b c). This means that 'a' is equivalent to matching 'b' and 'c'; this goes over the automaton's capabilities since we need to count. The same happens for (a | b<c >), which requires extra nodes, let alone (a | b<c> d<e>) and so on. And on top of all these, any of them could be a wildcard. The cases may be few, but there are enough. Attempting to implement this approach was frustrating; there were always more and more cases that kept breaking the code. Plus the complexity of the whole thing gave me the impression that it would really slow down the matcher.

So after thinking this over, I've realized that by using the 'characteristic pattern' concept I had mentioned could simplify alternatives support, since only a single linear pattern will be matched with multiple transitions for each group of alternatives, eliminating most of the complexity of the previous approach. At this point the code is already able to extract this single patterns; it turns out that it is much easier than I thought at first and the results look pretty good. Most fixer patterns contain either an uncommon keyword or character so the simple approach of preferring these subpatterns yields satisfying results. For example the result for the fix_filter pattern is:

[(['lambda', 300, 259, 332, 308], ['filter, 308'], ['filter', 308])]

The tuple means 'a group of alternatives'. The last two alternatives can be merged, so we are left with only two. Using these subpatterns to match this fairly complex fixer means that we only check the cases that start with these not-so-common keywords, 'lambda' and 'filter'. Now, as far as alternatives are concerned the logic behind this is to add those subpatterns as if they weren't alternatives, and merge their final states. Having a single subpattern to match also skips the step of checking for a complete match set for a fixer, since a single match is now adequate.

Of course this means that the accuracy of the matcher will be somewhat reduced and more candidate nodes will be marked for further matching, i.e. buffer('test') but also buffer.test(); but obviously this is not common and from what I've seen so far it won't be much of a concern. A benefit of using a non-accurate matcher is that it simplifies the handling of negative patterns: they are treated as wildcards and the actual matcher is responsible for sorting them out.

So, overall, I think that I'm finally getting closer to a matcher that works for almost all fixers(with the exception of the explicit ws_comma and idioms fixers, which use wildcards at higher levels).

Monday, July 12, 2010

Instructions for running the new matcher algorithm so far

I've added some logging messages to the bottom matcher module to illustrate the matching process.

Now, in order to test the new module:
1) Checkout a local copy:
svn checkout http://2to3-speedup2.googlecode.com/svn/trunk/ 2to3
2) Create a small file for testing. Add a few statements plus a statement that matches a fixer.
For example:

#test.py
import sys

buffer("Hello World")

(we'll try to match fix_buffer). More complex files can be tested of course, but it's harder to keep track of the process for a larger file.
3)Run ./test2to3.sh with -f options for the current working fixers(buffer, raw_input, isinstance, raise etc.). Add the -v option to get the tracing messages.

For example run:

./test2to3.sh -f buffer test.py -v

First the automaton will be printed in DOT format. You can paste the result in http://graph.gafol.net/ to get a quick rendering.

This is the output from running the above command(some is omitted):

RefactoringTool: Adding transformation: buffer
RefactoringTool: Refactoring test.py
RefactoringTool: Adding fixer patterns to the automaton
RefactoringTool: The automaton is printed below in DOT
digraph g{
0 -> 1 [label=buffer] //[]
1 -> 2 [label=power] //[(0, 1)]
0 -> 3 [label=8] //[]
3 -> 4 [label=trailer] //[]
4 -> 5 [label=power] //[(0, 2)]
0 -> 6 [label=7] //[]
6 -> 7 [label=trailer] //[]
7 -> 8 [label=power] //[(0, 3)]
}

The constructed automaton looks like this:

Now the automaton has been constructed and we start matching from each leaf up to the root of the AST.

RefactoringTool: Running automaton for each leaf
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:0, content is Leaf(7, '(')
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:1, content is Leaf(8, ')')
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:2, content is Leaf(1, 'power')
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:3, content is Leaf(20, '<')
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: All nodes above have been checked; moving on to next leaf
RefactoringTool: -------------------------

The above leaves produced no results; the first couple are orphans, and the rest reset the automaton to the start state. The automaton won't even try matching higher when it's been reset, since there is no way a match can be produced. The process goes on until it gets some matches:

RefactoringTool: Leaf ID:44, content is Leaf(1, 'buffer')
RefactoringTool: Name 'buffer' matches : Moving from state node 0 to state node 1
RefactoringTool: Token power matches : Moving from state node 1 to state node 2
RefactoringTool: Subpattern #1 for fixer #0 matches for node 49
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:45, content is Leaf(7, '(')
RefactoringTool: Token 7 matches : Moving from state node 0 to state node 6
RefactoringTool: Token trailer matches : Moving from state node 6 to state node 7
RefactoringTool: Token power matches : Moving from state node 7 to state node 8
RefactoringTool: Subpattern #3 for fixer #0 matches for node 49
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------

While matching the automaton changes it's state until it reaches a matching state; in that case a subpattern's id is added to the current AST node and the node is tested for a whole fixer's match. For example, node 2 of the automaton contained a trigger for a subpattern #1 and node 8 for subpattern #3. Now node 49 is marked with both of them; if we can match subpattern #2 on the same node we get a full match.

RefactoringTool: -------------------------
RefactoringTool: Leaf ID:47, content is Leaf(8, ')')
RefactoringTool: Token 8 matches : Moving from state node 0 to state node 3
RefactoringTool: Token trailer matches : Moving from state node 3 to state node 4
RefactoringTool: Token power matches : Moving from state node 4 to state node 5
RefactoringTool: Subpattern #2 for fixer #0 matches for node 49
RefactoringTool: SUCCESS: All subpatterns of fixer #0 match for node 49.
RefactoringTool: The node's content is:
Node(power, [Leaf(1, 'buffer'), Node(trailer, [Leaf(7, '('), Leaf(3, '"Hello World"'), Leaf(8, ')')])])
which is effectively:
buffer("Hello World")

The last subpattern for fix_buffer was matched for node #49, so the whole fixer matches. The node is sent to the old matcher for further processing.

RefactoringTool: Refactored test.py
--- test.py (original)
+++ test.py (refactored)
@@ -1,3 +1,3 @@
import sys

-buffer("Hello World")
+memoryview("Hello World")
RefactoringTool: Not writing changes to test.py
RefactoringTool: Files that need to be modified:
RefactoringTool: test.py

This is roughly the whole process of the algorithm. The key point to note here is that the AST is traversed only once, using a combined-pattern automaton to match all fixers in a single pass. From an algorithmic point of view, the time complexity is O(n). Of course the extra effort needed to mark subpatterns and check the match sets for each fixer reduces somewhat this benefit, but for a sufficiently large number of patterns there should be a significant speedup.

On the issue of alternatives, I must admit I'm not very happy with my progress, as the task has proved much harder than I thought. The code is still a mess and I had to rewrite a large portion. Unfortunately I'll need some more time to get this over with, but hopefully it will be done soon.

Thursday, July 8, 2010

Status report

Most of my effort the last few days has been getting alternatives to work. This isn't as easy as I hoped; I've completed the part of pulling the subpatterns from the existing pattern tree, but the code is a mess and the matcher has to be updated. Part of the problem is maintaining the logical consistency of the algorithm while extending it. Anyway, at least I feel that the ugly part has been completed, and hopefully the alternatives functionality will be supported by the weekend. This will allow for some useful profiling results as sufficiently complex patterns will be available for comparison, provided they do not contain negative patterns(e.g. fix_filter, fix_imports etc.).

In any case, I'll also start working tomorrow on a demonstration of the current functionality, adding some tracing of the process as it was discussed on Tuesday's meeting to make things more clear.

Wednesday, June 30, 2010

Optimization changes, somewhat promising test results

I replaced the KeyError exception with a conditional inside the matcher's main loop, turns out it's actually slower. I also added a flag on each node to avoid revisiting higher nodes on each run, which was actually a significant speedup since it reduces the comparisons by ~30%! Some new test results now look promising. I'm currently focusing on single-file runs since they are simpler. Running the command below:

time for((c=1;c<=20;c++)) do

./test2to3.sh -f isinstance -f raise -f zip -f raw_input -f buffer \

  ../django-3k/django/core/servers/basehttp.py

done

yielded: 

real 0m10.324s 
user 0m9.709s 
sys 0m0.292s  

The results of a control run with the original code were these:

real 0m11.237s 
user 0m10.701s 
sys 0m0.308s

(These were actually the best out of 10 runs, given time's inconsistency; results were still lower on average. I'll redo the tests with profile when I'll have more time)

That's only a slight improvement and there's a catch, basehttp.py is quite a large file. But I'm happy with the fact that it can at least run with roughly the same speed.

An issue I'm currently concerned with is that really common subpatterns may add a significant overhead when they are matched. For example )->trailer->power may be matched a few hundred times; this means that we'll have to update the match sets of perhaps 20-30 fixers, and check if the fixer matches every single time.

If we remove common subpatterns, it will probably speed things up(for example, for the case of raw_input we can throw away all subpatterns except the one starting with 'raw_input'); but then the matcher won't be 100% accurate(although only in very few cases on well-formed code). It can however send candidate nodes to the existing matcher(that's what it does anyway right now to fill the pattern variables) which can reject them. Plus this could also deal with negated patterns; just ignore them and the original matcher will do the work. I'll give it a try next week to see if it justifies the added complexity.

Plus I changed the blog's template to something that won't hurt your eyes as much.

Monday, June 28, 2010

First version of the bottom matcher

I just committed the first draft version of the bottom matching algorithm. It includes the basic module that builds & runs the automaton, plus a separate module with utility functions.

What works:

raw_input, buffer etc. all fixers that can have their wildcards reduced
isinstance, and all fixers with wildcards at the bottom

What doesn't:

Alternatives: they seem easy to implement, will be added next week
Wildcards at higher levels of the pattern: somewhat rare, left out for now (see the ws_comma for a case)
Negative patterns: matching them should prevent the whole pattern from being matched. They rely on alternatives, so they are scheduled after them

The code is a bit hackish at points, using module globals to grab the AST leaves etc. But as I said, it's a draft. No optimization effort has been made since this was mainly a proof of concept. I'm planning for a rewrite when most functionality is complete, and the algorithm is proven to be faster.

Still, I was a bit disappointed that it ran slightly slower than the current version. I was expecting at least a tiny improvement. I expect the automaton to pick up some speed when it's properly simulated by a table, as it is currently run by traversing linked nodes. Plus we can use threads to... OK, I'm just kidding:)

The code enters at refactor.py:RefactoringTool.refactor_string, where refactor_tree is replaced with the new refactor_tree_btm. Some (horrible) changes have been made to pytree.py. The code can be found at the repository: http://code.google.com/p/2to3-speedup2/

Tuesday, June 22, 2010

A preview of the fixer automaton and current progress

I hoped to have a prototype ready by this weekend, but school got in the way and I couldn't find the time to complete it. The automaton can currently add constant(without wildcards) subpatterns to itself and run on a tree. I'm currently working on a function that converts a compiled pattern tree to its subpatterns in order to test more fixers, which turned out to be much harder than I thought since Base/Node/WildcardPattern objects were not really made for this. But I'm close. Anyway, I've made a graph of the current automaton for the raw_input fixer in order to make the process more clear:

The pattern was power< name='raw_input' trailer< '(' [any] ')' > any* > . The pattern was reduced(by hand) to it's minimum subpatterns:

{ '('->trailer->power, raw_input->power, ')'->trailer->power }

from which the above automaton was built. Now consider adding the very similar pattern for the buffer fixer. The resulting automaton would look like this:

The automaton begins for each leaf, and when each subpattern is matched the current node is marked with its id. When all subpatterns match on a single node for a fixer, the whole fixer matches. The patterns for other fixers will be added on the same automaton. There are some subtleties here: wildcard subpatterns need to be split at the wildcard positions and added as new subpatterns on all nodes starting with the next token, plus once on the root; alternatives must have their destination nodes combined on a single subpattern; repeaters with constant numbers need to be multiply matched etc. But overall, I believe that each issue can be handled without restricting the current fixer syntax.

Sunday, June 13, 2010

Progress Report

Exams start tomorrow and I've spent most of this week studying. But I've had the chance to start coding a prototype of the matching algorithm I proposed. It is completed to the point where it simulates the raw_input fixer pattern. Profiling results look promising; for example for this one fixer it took about 0.087 CPU seconds on a single file(preprocessing aside).

Of course running this single fixer with the current algorithm took about 0.018 seconds. But the new algorithm should behave much better when more fixer patterns are added, with processing time remaining close to the current level, so all in all I find this promising.

Let me explain. In theory, bottom-up tree traversal takes number_of_leaves*tree_height steps. Say b is the maximum number of children in a node and h is the height then this becomes (b^h)*h. h is about logn so the the traversal is in the O(n*logn) class. Of course the constant numbers depend on the characteristics of each tree. So, since the automaton runs in constant time on each step the number of comparisons is still O(n*logn).

The current version of the algorithm runs in O(n*pattern_nodes). Since pattern_nodes can be considered a constant number compared to n, this is O(n). Repeaters can add an overhead here when backtracking, depending on their position on the pattern_tree. But let's also consider this a constant factor, just for simplicity(I believe they add a b^h*h factor too; I have no time to check this).

The important thing to note is that the algorithm depends on the sizes of the patterns, which means that matching takes longer each time a pattern is added. The proposed algorithm is independent of the pattern sizes(except for the preprocessing part), so this is why I expect this number to stay close to the current level.

While writing this prototype I came to the realization that it is unnecessary to store the leaf-to-root strings; we can simply keep a set of the tree leaves and traverse them by asking each one for its parent. When a fixer transforms the tree, it currently marks the nodes that it changed. This can be extended to propagate the "changed" label to its node's children, and we can use this information to recheck only their descendant leaves.

A couple of other realizations I came to: patterns with wildcards need to be split up, then the subpatterns are added to the automaton skipping x states, where x is the subpattern's position. I'm working on the details of the implementation. Second, there is a great potential for threading on this algorithm, each thread can simulate the automaton on a different leaf in parallel so this will be something to consider in the future.

Monday, June 7, 2010

Some benchmarking results, and an implementation plan

Well I'm still not able to run profile for building django-3k and I'm not sure what it is that I'm doing wrong. But anyway, at least I got it to work for a single file. I used feedgenerator.py from the the django-3k codebase and here is the result, using gprof2dot:

No surprises here, the program spends most of its in the pattern matching functions, as it was already known(the green boxes down-left). Since a significant overhead in python is function calls, I'm still confident that a more iterative pattern matching algorithm that will avoid the number of recursions will be able to speed things up. Anyway, as a first reference here are the results of time:

Command being timed: "/usr/local/bin/python3 /home/giorgos/dev/gsoc/django-3k/setup.py build"
User time (seconds): 386.96
System time (seconds): 0.87
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 6:37.14
....

Note that I've built python with threads disabled, because they often mess up the profiler. But it shouldn't matter, results are always relative and threads are not used internally in 2to3.

Now, analyzing the algorithm I came up with, I'm pretty sure that theoretically it should offer no benefit compared to the existing one, i.e. I believe it runs the same number of comparisons. Of course preprocessing a few parts may help overall, but still. On the other hand, as I mentioned above I think a major issue in the existing algorithm is its recursive nature; the algorithm I propose may benefit from being a bit more "linear".

Anyway, running a few tests it seems to me that the most (practically) computationally intensive part is obtaining the bottom-up strings, and doing so every time the tree is altered. Profiling showed that it can take up to 0.5 secs to obtain the full set; that's quite a lot, and usually 1-2 fixers are applied per file so multiply that by 2. That's not so bad, but I'll see in the future if there is a faster way to obtain it.

So a portion the algorithm in pseudocode currently is something like this:

get_bottom-strings:
#Bottom strings will be lists of the format[(NodeId, TOKENTYPE), ...].
  for leave in leaves:
    return leave.type + parent.bottom_string() #recursively obtain bottom-up strings

get_pattern_dict:
  for fixer in fixers:
    reduce_repeaters(fixer.pattern_tree) # reduce repeaters to obtain a constant tree
        -> WildcardPattern is reduced to its MIN; this may actually create nodes
    for leave in pattern_tree.leaves:
      obtain bottom-up string
    return pattern_dict

Then we use the pattern_dictionaries to build the AC automaton. Now simulating the AC automaton in python may be too slow for our purposes here, so perhaps a C version could be considered in the future if that's an option. The automaton is constructed by adding each string at a time as detailed in the Aho-Corasick paper, and nodes will be represented by a ACNode type with a dictionary as a transition table. On each final node callbacks will be added to mark the position of the match. Finally a match dictionary with nodeId as the key will be used to check whether at any point a fixer has been matched.

So that pretty much covers it. I'm at the stage of planning how to fit all these in the existing code. I hope I'll have some time this week(my term exams start on Monday) to push this further.

Tuesday, May 25, 2010

A bottom-up tree matching algorithm

Seeing a tree from the bottom up is a bit like 'unfolding' it. The tree structure is converted into a series of linear paths starting from leaf to root.

So the first step in the algorithm is to create those leaf-to-root strings, by keeping a record of all leaves and propagating up the tree. In a common python file there are typically a few thousand leaves with an average height of 50-60 nodes. The process can have a significant overhead since it must be run once for every file but the final result may be worth it.

Now the main idea behind this algorithm is reducing the search space. We can extract from each complex pattern a "minimum" set of constant pattern paths that that we can use to mark candidate nodes. Say for example we have the same pattern tree for the print stmt:

The minimum constant path in this case is simple_stmt->print_stmt, which we obtain by reducing the wildcard patterns to their minimum sufficient pattern(in this case none). After reducing the repeater patterns we can obtain the path dictionary as in Algorithm D only this time we'll be writing them backwards.

For simplicity lets consider only the above pattern and the single path {print_stmt->simple_stmt}. We proceed by combining the constant patterns into a single Aho-Corasick automaton that resembles the one in Algorithm D. This part of the algorithm can be preprocessed and updated only when the fixer patterns change, and the resulting automaton can be pickled.

The resulting automaton is run on each leaf-to-root string and the nodes are marked when each constant path is matched. When the set of sufficient pattern paths is matched we run the existing matching algorithm on this node; this step is necessary because wildcard patterns have to still be matched exactly in order to return the correct content for each variable.

Is this process worth it? I haven't analyzed it yet. Traversing a tree from the bottom is inefficient since we run into the same nodes again and again. On the other checking linear patterns requires less time than tree patterns. Furthermore storing the strings as real immutable strings we can take advantage of the optimizations the interpreter provides, unlike running recursive calls to traverse the tree. Plus preprocessing the fixer patterns into an automaton can save some time.

The next step will be to formally analyze the algorithm and obtain some preliminary profiling results.

Why Algorithm D won't work

When I was hastily making my proposal for GSOC, I came upon Hoffman & O'Donnell's paper "Pattern Matching in Trees". The paper's subject seemed to fit perfectly 2to3's matching problem and I assumed that the algorithm they proposed, Algorithm D, could be somehow extended to include repeaters like * and +. After further research, I realized I was wrong; their algorithm depends on pattern tree nodes having a fixed arity, i.e. no repeaters.

To explain this a bit further, Algorithm D works like this: The pattern tree is converted into a trie, then a dictionary of paths is formed from the trie and an Aho-Corasick automaton is created from the dictionary. The final states of the automaton define the matching state; when all path-states have been reached the pattern has been matched.

Here's a demonstration to make this clear:

This is a pattern tree and the resulting trie. The pattern will match any print_stmt that follows any other statement(any is a wildcard).

The path dictionary would be {simple_stmt1, simple_stmt2print_stmt}. The resulting automaton is easily constructed from the trie and it would look like this(fail transitions are omitted):

The numbers correspond to a fixed position in the tree, that is starting from root we follow transition 1 when we encounter the first child, transition 2 for the second and so on.
But how could this pattern, that matches a print_stmt in any position, be converted into an automaton?

I was confident that the problem could be somehow worked around and extend the automata to include repeaters. No matter how much I tried I admit that I'm out of wits here. Even more, I have started to think that the top-down automaton is not capable of matching repeaters, because it somehow resembles counting; think about it, if the tree was in linear form and had deep paths left and right, say

simple_stmt( ((((....)))), print_stmt, ((((...)))) )

then you'd have to keep a count of the left&right parentheses: the classic case that exceeds the capabilities of finite automata.

Not all hope is lost however; I have thought of a different algorihm that will still use the Aho Corasick automaton but in a bottom-up manner. More details will follow in the next article

George's GSOC Blog