George's GSOC Blog: July 2010

Wednesday, July 21, 2010

Support for all default fixers and some great results!

The "characteristic path" approach proved excellent. Not only it proved sufficient for alternatives support, it also greatly simplified and speeded up the matcher. For one thing it eliminated the need for wildcards; it turns out that any meaningful pattern always seems to have a constant part that can be matched instead. Negated patterns are skipped as well. The results from running the new 2to3 on django-3k were:

real 1m39.585s
user 1m36.218s
sys 0m0.660s

The control run took:

real 3m18.960s
user 3m6.820s
sys 0m0.920s

That's around a 50% speedup. However there are still some bugs, and the output from the old and new algorithm won't match 100%(although in relatively few cases). Most of them look minor and I'll try fixing them this week. Re-transformed code is kind of an issue, and the current approach uses the old matcher for the transformed nodes(which adds around 10sec). And in some cases the old algorithm seems to be in error, changing for example x.values() to list(list(x.values())) instead of list(x.values()) (although I may be wrong here, I'll check again with other versions).

'any' at higher levels is not supported, so the two(non-default) fixers that use won't work. I'm not sure if they are worth it, since supporting them may complicate the code and slow down the matcher. I'll look into that too.

I've uploaded the code to the repository. You can test it using the instructions from the previous post(although there are now no debugging messages). The code is a mess of course and I'll try to clean it up the next days. But overall I'm really satisfied with these results, I think I've made some good progress and there is still plenty of time for more optimizations. Off for a beer.

Tuesday, July 20, 2010

Progress for this week

Just a quick update before today's meeting, I've spent the last week trying approaches to support alternatives and I had to start from scratch twice. I'll certainly won't use the word easy so freely again.

First of all realized that it's impossible to deduct the full pattern structure reliably from the compiled pattern, so spent much of my time into rewriting this part to deduct the subpatterns from the intermediate tree that the pattern compiler generates. There were some challenges, but I think I've reached the point where patterns can be extracted reliably.

Now, my first approach was converting a pattern with alternatives into multiple simpler patterns, creating n^b new patterns, n being the nesting factor and b the number of alternatives per level of nesting. Needless to say that this quickly exploded the number of patterns so I quickly abandoned it.

The second approach was adding multiple transitions at nodes that correspond to alternatives. This approach is a bit more complex; for one thing there are cases where a single pattern is alternative to a group of patterns

i.e. ( a | b c). This means that 'a' is equivalent to matching 'b' and 'c'; this goes over the automaton's capabilities since we need to count. The same happens for (a | b<c >), which requires extra nodes, let alone (a | b<c> d<e>) and so on. And on top of all these, any of them could be a wildcard. The cases may be few, but there are enough. Attempting to implement this approach was frustrating; there were always more and more cases that kept breaking the code. Plus the complexity of the whole thing gave me the impression that it would really slow down the matcher.

So after thinking this over, I've realized that by using the 'characteristic pattern' concept I had mentioned could simplify alternatives support, since only a single linear pattern will be matched with multiple transitions for each group of alternatives, eliminating most of the complexity of the previous approach. At this point the code is already able to extract this single patterns; it turns out that it is much easier than I thought at first and the results look pretty good. Most fixer patterns contain either an uncommon keyword or character so the simple approach of preferring these subpatterns yields satisfying results. For example the result for the fix_filter pattern is:

[(['lambda', 300, 259, 332, 308], ['filter, 308'], ['filter', 308])]

The tuple means 'a group of alternatives'. The last two alternatives can be merged, so we are left with only two. Using these subpatterns to match this fairly complex fixer means that we only check the cases that start with these not-so-common keywords, 'lambda' and 'filter'. Now, as far as alternatives are concerned the logic behind this is to add those subpatterns as if they weren't alternatives, and merge their final states. Having a single subpattern to match also skips the step of checking for a complete match set for a fixer, since a single match is now adequate.

Of course this means that the accuracy of the matcher will be somewhat reduced and more candidate nodes will be marked for further matching, i.e. buffer('test') but also buffer.test(); but obviously this is not common and from what I've seen so far it won't be much of a concern. A benefit of using a non-accurate matcher is that it simplifies the handling of negative patterns: they are treated as wildcards and the actual matcher is responsible for sorting them out.

So, overall, I think that I'm finally getting closer to a matcher that works for almost all fixers(with the exception of the explicit ws_comma and idioms fixers, which use wildcards at higher levels).

Monday, July 12, 2010

Instructions for running the new matcher algorithm so far

I've added some logging messages to the bottom matcher module to illustrate the matching process.

Now, in order to test the new module:
1) Checkout a local copy:
svn checkout http://2to3-speedup2.googlecode.com/svn/trunk/ 2to3
2) Create a small file for testing. Add a few statements plus a statement that matches a fixer.
For example:

#test.py
import sys

buffer("Hello World")

(we'll try to match fix_buffer). More complex files can be tested of course, but it's harder to keep track of the process for a larger file.
3)Run ./test2to3.sh with -f options for the current working fixers(buffer, raw_input, isinstance, raise etc.). Add the -v option to get the tracing messages.

For example run:

./test2to3.sh -f buffer test.py -v

First the automaton will be printed in DOT format. You can paste the result in http://graph.gafol.net/ to get a quick rendering.

This is the output from running the above command(some is omitted):

RefactoringTool: Adding transformation: buffer
RefactoringTool: Refactoring test.py
RefactoringTool: Adding fixer patterns to the automaton
RefactoringTool: The automaton is printed below in DOT
digraph g{
0 -> 1 [label=buffer] //[]
1 -> 2 [label=power] //[(0, 1)]
0 -> 3 [label=8] //[]
3 -> 4 [label=trailer] //[]
4 -> 5 [label=power] //[(0, 2)]
0 -> 6 [label=7] //[]
6 -> 7 [label=trailer] //[]
7 -> 8 [label=power] //[(0, 3)]
}

The constructed automaton looks like this:

Now the automaton has been constructed and we start matching from each leaf up to the root of the AST.

RefactoringTool: Running automaton for each leaf
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:0, content is Leaf(7, '(')
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:1, content is Leaf(8, ')')
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:2, content is Leaf(1, 'power')
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:3, content is Leaf(20, '<')
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: All nodes above have been checked; moving on to next leaf
RefactoringTool: -------------------------

The above leaves produced no results; the first couple are orphans, and the rest reset the automaton to the start state. The automaton won't even try matching higher when it's been reset, since there is no way a match can be produced. The process goes on until it gets some matches:

RefactoringTool: Leaf ID:44, content is Leaf(1, 'buffer')
RefactoringTool: Name 'buffer' matches : Moving from state node 0 to state node 1
RefactoringTool: Token power matches : Moving from state node 1 to state node 2
RefactoringTool: Subpattern #1 for fixer #0 matches for node 49
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:45, content is Leaf(7, '(')
RefactoringTool: Token 7 matches : Moving from state node 0 to state node 6
RefactoringTool: Token trailer matches : Moving from state node 6 to state node 7
RefactoringTool: Token power matches : Moving from state node 7 to state node 8
RefactoringTool: Subpattern #3 for fixer #0 matches for node 49
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------

While matching the automaton changes it's state until it reaches a matching state; in that case a subpattern's id is added to the current AST node and the node is tested for a whole fixer's match. For example, node 2 of the automaton contained a trigger for a subpattern #1 and node 8 for subpattern #3. Now node 49 is marked with both of them; if we can match subpattern #2 on the same node we get a full match.

RefactoringTool: -------------------------
RefactoringTool: Leaf ID:47, content is Leaf(8, ')')
RefactoringTool: Token 8 matches : Moving from state node 0 to state node 3
RefactoringTool: Token trailer matches : Moving from state node 3 to state node 4
RefactoringTool: Token power matches : Moving from state node 4 to state node 5
RefactoringTool: Subpattern #2 for fixer #0 matches for node 49
RefactoringTool: SUCCESS: All subpatterns of fixer #0 match for node 49.
RefactoringTool: The node's content is:
Node(power, [Leaf(1, 'buffer'), Node(trailer, [Leaf(7, '('), Leaf(3, '"Hello World"'), Leaf(8, ')')])])
which is effectively:
buffer("Hello World")

The last subpattern for fix_buffer was matched for node #49, so the whole fixer matches. The node is sent to the old matcher for further processing.

RefactoringTool: Refactored test.py
--- test.py (original)
+++ test.py (refactored)
@@ -1,3 +1,3 @@
import sys

-buffer("Hello World")
+memoryview("Hello World")
RefactoringTool: Not writing changes to test.py
RefactoringTool: Files that need to be modified:
RefactoringTool: test.py

This is roughly the whole process of the algorithm. The key point to note here is that the AST is traversed only once, using a combined-pattern automaton to match all fixers in a single pass. From an algorithmic point of view, the time complexity is O(n). Of course the extra effort needed to mark subpatterns and check the match sets for each fixer reduces somewhat this benefit, but for a sufficiently large number of patterns there should be a significant speedup.

On the issue of alternatives, I must admit I'm not very happy with my progress, as the task has proved much harder than I thought. The code is still a mess and I had to rewrite a large portion. Unfortunately I'll need some more time to get this over with, but hopefully it will be done soon.

Thursday, July 8, 2010

Status report

Most of my effort the last few days has been getting alternatives to work. This isn't as easy as I hoped; I've completed the part of pulling the subpatterns from the existing pattern tree, but the code is a mess and the matcher has to be updated. Part of the problem is maintaining the logical consistency of the algorithm while extending it. Anyway, at least I feel that the ugly part has been completed, and hopefully the alternatives functionality will be supported by the weekend. This will allow for some useful profiling results as sufficiently complex patterns will be available for comparison, provided they do not contain negative patterns(e.g. fix_filter, fix_imports etc.).

In any case, I'll also start working tomorrow on a demonstration of the current functionality, adding some tracing of the process as it was discussed on Tuesday's meeting to make things more clear.

George's GSOC Blog