George's GSOC Blog

Sunday, August 15, 2010

Code committed to the sandbox and an overview of the project

After porting the code to 2.x, which proved to be a somewhat tricky process, the code was committed in the 2to3 sandbox repository. This concludes the project on my part, although I will still be around to provide support for the new code, at least until it reaches a sufficient point of maturity.

The project was overall a great experience for me; the proposal was challenging on both the algorithmic and coding part. What I've really come to realize was how much the theoretical description of a problem can differ from it's practical applications. Matching tree regular expressions in an efficient way in 2to3's context went well beyond what the research articles I had in mind at first offered, and implementing a new algorithm that can retain it's soundness while optimizing it was a quite stressful, but rewarding, experience. Most of the effort was spent on preserving compatibility with the previous matcher; the fixers were made to match and transform code in assumption of a pre/post-order traversal of the AST and various issues had to be handled on a case by case basis.

Although the benefits proved substantial, the effort taken and the complexity of the process often made me think if the whole concept is worth a few machine cycles; admittedly, as it is also the case with 2to3 itself, optimizations suffer from a low impresiveness/complexity ratio that tends to drive people away from involvement and involved people to insanity. But in conclusion, I'm proud(and, to be honest, still find it hard to believe) that this thing actually worked.

Wednesday, July 21, 2010

Support for all default fixers and some great results!

The "characteristic path" approach proved excellent. Not only it proved sufficient for alternatives support, it also greatly simplified and speeded up the matcher. For one thing it eliminated the need for wildcards; it turns out that any meaningful pattern always seems to have a constant part that can be matched instead. Negated patterns are skipped as well. The results from running the new 2to3 on django-3k were:

real 1m39.585s
user 1m36.218s
sys 0m0.660s

The control run took:

real 3m18.960s
user 3m6.820s
sys 0m0.920s

That's around a 50% speedup. However there are still some bugs, and the output from the old and new algorithm won't match 100%(although in relatively few cases). Most of them look minor and I'll try fixing them this week. Re-transformed code is kind of an issue, and the current approach uses the old matcher for the transformed nodes(which adds around 10sec). And in some cases the old algorithm seems to be in error, changing for example x.values() to list(list(x.values())) instead of list(x.values()) (although I may be wrong here, I'll check again with other versions).

'any' at higher levels is not supported, so the two(non-default) fixers that use won't work. I'm not sure if they are worth it, since supporting them may complicate the code and slow down the matcher. I'll look into that too.

I've uploaded the code to the repository. You can test it using the instructions from the previous post(although there are now no debugging messages). The code is a mess of course and I'll try to clean it up the next days. But overall I'm really satisfied with these results, I think I've made some good progress and there is still plenty of time for more optimizations. Off for a beer.

Tuesday, July 20, 2010

Progress for this week

Just a quick update before today's meeting, I've spent the last week trying approaches to support alternatives and I had to start from scratch twice. I'll certainly won't use the word easy so freely again.

First of all realized that it's impossible to deduct the full pattern structure reliably from the compiled pattern, so spent much of my time into rewriting this part to deduct the subpatterns from the intermediate tree that the pattern compiler generates. There were some challenges, but I think I've reached the point where patterns can be extracted reliably.

Now, my first approach was converting a pattern with alternatives into multiple simpler patterns, creating n^b new patterns, n being the nesting factor and b the number of alternatives per level of nesting. Needless to say that this quickly exploded the number of patterns so I quickly abandoned it.

The second approach was adding multiple transitions at nodes that correspond to alternatives. This approach is a bit more complex; for one thing there are cases where a single pattern is alternative to a group of patterns

i.e. ( a | b c). This means that 'a' is equivalent to matching 'b' and 'c'; this goes over the automaton's capabilities since we need to count. The same happens for (a | b<c >), which requires extra nodes, let alone (a | b<c> d<e>) and so on. And on top of all these, any of them could be a wildcard. The cases may be few, but there are enough. Attempting to implement this approach was frustrating; there were always more and more cases that kept breaking the code. Plus the complexity of the whole thing gave me the impression that it would really slow down the matcher.

So after thinking this over, I've realized that by using the 'characteristic pattern' concept I had mentioned could simplify alternatives support, since only a single linear pattern will be matched with multiple transitions for each group of alternatives, eliminating most of the complexity of the previous approach. At this point the code is already able to extract this single patterns; it turns out that it is much easier than I thought at first and the results look pretty good. Most fixer patterns contain either an uncommon keyword or character so the simple approach of preferring these subpatterns yields satisfying results. For example the result for the fix_filter pattern is:

[(['lambda', 300, 259, 332, 308], ['filter, 308'], ['filter', 308])]

The tuple means 'a group of alternatives'. The last two alternatives can be merged, so we are left with only two. Using these subpatterns to match this fairly complex fixer means that we only check the cases that start with these not-so-common keywords, 'lambda' and 'filter'. Now, as far as alternatives are concerned the logic behind this is to add those subpatterns as if they weren't alternatives, and merge their final states. Having a single subpattern to match also skips the step of checking for a complete match set for a fixer, since a single match is now adequate.

Of course this means that the accuracy of the matcher will be somewhat reduced and more candidate nodes will be marked for further matching, i.e. buffer('test') but also buffer.test(); but obviously this is not common and from what I've seen so far it won't be much of a concern. A benefit of using a non-accurate matcher is that it simplifies the handling of negative patterns: they are treated as wildcards and the actual matcher is responsible for sorting them out.

So, overall, I think that I'm finally getting closer to a matcher that works for almost all fixers(with the exception of the explicit ws_comma and idioms fixers, which use wildcards at higher levels).

Monday, July 12, 2010

Instructions for running the new matcher algorithm so far

I've added some logging messages to the bottom matcher module to illustrate the matching process.

Now, in order to test the new module:
1) Checkout a local copy:
svn checkout http://2to3-speedup2.googlecode.com/svn/trunk/ 2to3
2) Create a small file for testing. Add a few statements plus a statement that matches a fixer.
For example:

#test.py
import sys

buffer("Hello World")

(we'll try to match fix_buffer). More complex files can be tested of course, but it's harder to keep track of the process for a larger file.
3)Run ./test2to3.sh with -f options for the current working fixers(buffer, raw_input, isinstance, raise etc.). Add the -v option to get the tracing messages.

For example run:

./test2to3.sh -f buffer test.py -v

First the automaton will be printed in DOT format. You can paste the result in http://graph.gafol.net/ to get a quick rendering.

This is the output from running the above command(some is omitted):

RefactoringTool: Adding transformation: buffer
RefactoringTool: Refactoring test.py
RefactoringTool: Adding fixer patterns to the automaton
RefactoringTool: The automaton is printed below in DOT
digraph g{
0 -> 1 [label=buffer] //[]
1 -> 2 [label=power] //[(0, 1)]
0 -> 3 [label=8] //[]
3 -> 4 [label=trailer] //[]
4 -> 5 [label=power] //[(0, 2)]
0 -> 6 [label=7] //[]
6 -> 7 [label=trailer] //[]
7 -> 8 [label=power] //[(0, 3)]
}

The constructed automaton looks like this:

Now the automaton has been constructed and we start matching from each leaf up to the root of the AST.

RefactoringTool: Running automaton for each leaf
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:0, content is Leaf(7, '(')
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:1, content is Leaf(8, ')')
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:2, content is Leaf(1, 'power')
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:3, content is Leaf(20, '<')
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: All nodes above have been checked; moving on to next leaf
RefactoringTool: -------------------------

The above leaves produced no results; the first couple are orphans, and the rest reset the automaton to the start state. The automaton won't even try matching higher when it's been reset, since there is no way a match can be produced. The process goes on until it gets some matches:

RefactoringTool: Leaf ID:44, content is Leaf(1, 'buffer')
RefactoringTool: Name 'buffer' matches : Moving from state node 0 to state node 1
RefactoringTool: Token power matches : Moving from state node 1 to state node 2
RefactoringTool: Subpattern #1 for fixer #0 matches for node 49
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:45, content is Leaf(7, '(')
RefactoringTool: Token 7 matches : Moving from state node 0 to state node 6
RefactoringTool: Token trailer matches : Moving from state node 6 to state node 7
RefactoringTool: Token power matches : Moving from state node 7 to state node 8
RefactoringTool: Subpattern #3 for fixer #0 matches for node 49
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------

While matching the automaton changes it's state until it reaches a matching state; in that case a subpattern's id is added to the current AST node and the node is tested for a whole fixer's match. For example, node 2 of the automaton contained a trigger for a subpattern #1 and node 8 for subpattern #3. Now node 49 is marked with both of them; if we can match subpattern #2 on the same node we get a full match.

RefactoringTool: -------------------------
RefactoringTool: Leaf ID:47, content is Leaf(8, ')')
RefactoringTool: Token 8 matches : Moving from state node 0 to state node 3
RefactoringTool: Token trailer matches : Moving from state node 3 to state node 4
RefactoringTool: Token power matches : Moving from state node 4 to state node 5
RefactoringTool: Subpattern #2 for fixer #0 matches for node 49
RefactoringTool: SUCCESS: All subpatterns of fixer #0 match for node 49.
RefactoringTool: The node's content is:
Node(power, [Leaf(1, 'buffer'), Node(trailer, [Leaf(7, '('), Leaf(3, '"Hello World"'), Leaf(8, ')')])])
which is effectively:
buffer("Hello World")

The last subpattern for fix_buffer was matched for node #49, so the whole fixer matches. The node is sent to the old matcher for further processing.

RefactoringTool: Refactored test.py
--- test.py (original)
+++ test.py (refactored)
@@ -1,3 +1,3 @@
import sys

-buffer("Hello World")
+memoryview("Hello World")
RefactoringTool: Not writing changes to test.py
RefactoringTool: Files that need to be modified:
RefactoringTool: test.py

This is roughly the whole process of the algorithm. The key point to note here is that the AST is traversed only once, using a combined-pattern automaton to match all fixers in a single pass. From an algorithmic point of view, the time complexity is O(n). Of course the extra effort needed to mark subpatterns and check the match sets for each fixer reduces somewhat this benefit, but for a sufficiently large number of patterns there should be a significant speedup.

On the issue of alternatives, I must admit I'm not very happy with my progress, as the task has proved much harder than I thought. The code is still a mess and I had to rewrite a large portion. Unfortunately I'll need some more time to get this over with, but hopefully it will be done soon.

Thursday, July 8, 2010

Status report

Most of my effort the last few days has been getting alternatives to work. This isn't as easy as I hoped; I've completed the part of pulling the subpatterns from the existing pattern tree, but the code is a mess and the matcher has to be updated. Part of the problem is maintaining the logical consistency of the algorithm while extending it. Anyway, at least I feel that the ugly part has been completed, and hopefully the alternatives functionality will be supported by the weekend. This will allow for some useful profiling results as sufficiently complex patterns will be available for comparison, provided they do not contain negative patterns(e.g. fix_filter, fix_imports etc.).

In any case, I'll also start working tomorrow on a demonstration of the current functionality, adding some tracing of the process as it was discussed on Tuesday's meeting to make things more clear.

Wednesday, June 30, 2010

Optimization changes, somewhat promising test results

I replaced the KeyError exception with a conditional inside the matcher's main loop, turns out it's actually slower. I also added a flag on each node to avoid revisiting higher nodes on each run, which was actually a significant speedup since it reduces the comparisons by ~30%! Some new test results now look promising. I'm currently focusing on single-file runs since they are simpler. Running the command below:

time for((c=1;c<=20;c++)) do

./test2to3.sh -f isinstance -f raise -f zip -f raw_input -f buffer \

  ../django-3k/django/core/servers/basehttp.py

done

yielded: 

real 0m10.324s 
user 0m9.709s 
sys 0m0.292s  

The results of a control run with the original code were these:

real 0m11.237s 
user 0m10.701s 
sys 0m0.308s

(These were actually the best out of 10 runs, given time's inconsistency; results were still lower on average. I'll redo the tests with profile when I'll have more time)

That's only a slight improvement and there's a catch, basehttp.py is quite a large file. But I'm happy with the fact that it can at least run with roughly the same speed.

An issue I'm currently concerned with is that really common subpatterns may add a significant overhead when they are matched. For example )->trailer->power may be matched a few hundred times; this means that we'll have to update the match sets of perhaps 20-30 fixers, and check if the fixer matches every single time.

If we remove common subpatterns, it will probably speed things up(for example, for the case of raw_input we can throw away all subpatterns except the one starting with 'raw_input'); but then the matcher won't be 100% accurate(although only in very few cases on well-formed code). It can however send candidate nodes to the existing matcher(that's what it does anyway right now to fill the pattern variables) which can reject them. Plus this could also deal with negated patterns; just ignore them and the original matcher will do the work. I'll give it a try next week to see if it justifies the added complexity.

Plus I changed the blog's template to something that won't hurt your eyes as much.

Monday, June 28, 2010

First version of the bottom matcher

I just committed the first draft version of the bottom matching algorithm. It includes the basic module that builds & runs the automaton, plus a separate module with utility functions.

What works:

raw_input, buffer etc. all fixers that can have their wildcards reduced
isinstance, and all fixers with wildcards at the bottom

What doesn't:

Alternatives: they seem easy to implement, will be added next week
Wildcards at higher levels of the pattern: somewhat rare, left out for now (see the ws_comma for a case)
Negative patterns: matching them should prevent the whole pattern from being matched. They rely on alternatives, so they are scheduled after them

The code is a bit hackish at points, using module globals to grab the AST leaves etc. But as I said, it's a draft. No optimization effort has been made since this was mainly a proof of concept. I'm planning for a rewrite when most functionality is complete, and the algorithm is proven to be faster.

Still, I was a bit disappointed that it ran slightly slower than the current version. I was expecting at least a tiny improvement. I expect the automaton to pick up some speed when it's properly simulated by a table, as it is currently run by traversing linked nodes. Plus we can use threads to... OK, I'm just kidding:)

The code enters at refactor.py:RefactoringTool.refactor_string, where refactor_tree is replaced with the new refactor_tree_btm. Some (horrible) changes have been made to pytree.py. The code can be found at the repository: http://code.google.com/p/2to3-speedup2/