Sunday, August 15, 2010
Code committed to the sandbox and an overview of the project
The project was overall a great experience for me; the proposal was challenging on both the algorithmic and coding part. What I've really come to realize was how much the theoretical description of a problem can differ from it's practical applications. Matching tree regular expressions in an efficient way in 2to3's context went well beyond what the research articles I had in mind at first offered, and implementing a new algorithm that can retain it's soundness while optimizing it was a quite stressful, but rewarding, experience. Most of the effort was spent on preserving compatibility with the previous matcher; the fixers were made to match and transform code in assumption of a pre/post-order traversal of the AST and various issues had to be handled on a case by case basis.
Although the benefits proved substantial, the effort taken and the complexity of the process often made me think if the whole concept is worth a few machine cycles; admittedly, as it is also the case with 2to3 itself, optimizations suffer from a low impresiveness/complexity ratio that tends to drive people away from involvement and involved people to insanity. But in conclusion, I'm proud(and, to be honest, still find it hard to believe) that this thing actually worked.
Wednesday, July 21, 2010
Support for all default fixers and some great results!
real 1m39.585s
user 1m36.218s
sys 0m0.660s
The control run took:
real 3m18.960s
user 3m6.820s
sys 0m0.920s
That's around a 50% speedup. However there are still some bugs, and the output from the old and new algorithm won't match 100%(although in relatively few cases). Most of them look minor and I'll try fixing them this week. Re-transformed code is kind of an issue, and the current approach uses the old matcher for the transformed nodes(which adds around 10sec). And in some cases the old algorithm seems to be in error, changing for example x.values() to list(list(x.values())) instead of list(x.values()) (although I may be wrong here, I'll check again with other versions).
'any' at higher levels is not supported, so the two(non-default) fixers that use won't work. I'm not sure if they are worth it, since supporting them may complicate the code and slow down the matcher. I'll look into that too.
I've uploaded the code to the repository. You can test it using the instructions from the previous post(although there are now no debugging messages). The code is a mess of course and I'll try to clean it up the next days. But overall I'm really satisfied with these results, I think I've made some good progress and there is still plenty of time for more optimizations. Off for a beer.
Tuesday, July 20, 2010
Progress for this week
Monday, July 12, 2010
Instructions for running the new matcher algorithm so far
Now, in order to test the new module:
1) Checkout a local copy:
svn checkout http://2to3-speedup2.googlecode.com/svn/trunk/ 2to3
2) Create a small file for testing. Add a few statements plus a statement that matches a fixer.
For example:
#test.py
import sys
buffer("Hello World")
(we'll try to match fix_buffer). More complex files can be tested of course, but it's harder to keep track of the process for a larger file.
3)Run ./test2to3.sh with -f options for the current working fixers(buffer, raw_input, isinstance, raise etc.). Add the -v option to get the tracing messages.
For example run:
This is the output from running the above command(some is omitted):
RefactoringTool: Adding transformation: buffer
RefactoringTool: Refactoring test.py
RefactoringTool: Adding fixer patterns to the automaton
RefactoringTool: The automaton is printed below in DOT
digraph g{
0 -> 1 [label=buffer] //[]
1 -> 2 [label=power] //[(0, 1)]
0 -> 3 [label=8] //[]
3 -> 4 [label=trailer] //[]
4 -> 5 [label=power] //[(0, 2)]
0 -> 6 [label=7] //[]
6 -> 7 [label=trailer] //[]
7 -> 8 [label=power] //[(0, 3)]
}
The constructed automaton looks like this:
Now the automaton has been constructed and we start matching from each leaf up to the root of the AST.
RefactoringTool: Running automaton for each leaf
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:0, content is Leaf(7, '(')
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:1, content is Leaf(8, ')')
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:2, content is Leaf(1, 'power')
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:3, content is Leaf(20, '<')
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: All nodes above have been checked; moving on to next leaf
RefactoringTool: -------------------------
The above leaves produced no results; the first couple are orphans, and the rest reset the automaton to the start state. The automaton won't even try matching higher when it's been reset, since there is no way a match can be produced. The process goes on until it gets some matches:
RefactoringTool: Leaf ID:44, content is Leaf(1, 'buffer')
RefactoringTool: Name 'buffer' matches : Moving from state node 0 to state node 1
RefactoringTool: Token power matches : Moving from state node 1 to state node 2
RefactoringTool: Subpattern #1 for fixer #0 matches for node 49
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:45, content is Leaf(7, '(')
RefactoringTool: Token 7 matches : Moving from state node 0 to state node 6
RefactoringTool: Token trailer matches : Moving from state node 6 to state node 7
RefactoringTool: Token power matches : Moving from state node 7 to state node 8
RefactoringTool: Subpattern #3 for fixer #0 matches for node 49
RefactoringTool: Matching failed; moving automaton back to state 0
RefactoringTool: -------------------------
While matching the automaton changes it's state until it reaches a matching state; in that case a subpattern's id is added to the current AST node and the node is tested for a whole fixer's match. For example, node 2 of the automaton contained a trigger for a subpattern #1 and node 8 for subpattern #3. Now node 49 is marked with both of them; if we can match subpattern #2 on the same node we get a full match.
RefactoringTool: -------------------------
RefactoringTool: Leaf ID:47, content is Leaf(8, ')')
RefactoringTool: Token 8 matches : Moving from state node 0 to state node 3
RefactoringTool: Token trailer matches : Moving from state node 3 to state node 4
RefactoringTool: Token power matches : Moving from state node 4 to state node 5
RefactoringTool: Subpattern #2 for fixer #0 matches for node 49
RefactoringTool: SUCCESS: All subpatterns of fixer #0 match for node 49.
RefactoringTool: The node's content is:
Node(power, [Leaf(1, 'buffer'), Node(trailer, [Leaf(7, '('), Leaf(3, '"Hello World"'), Leaf(8, ')')])])
which is effectively:
buffer("Hello World")
The last subpattern for fix_buffer was matched for node #49, so the whole fixer matches. The node is sent to the old matcher for further processing.
RefactoringTool: Refactored test.py
--- test.py (original)
+++ test.py (refactored)
@@ -1,3 +1,3 @@
import sys
-buffer("Hello World")
+memoryview("Hello World")
RefactoringTool: Not writing changes to test.py
RefactoringTool: Files that need to be modified:
RefactoringTool: test.py
This is roughly the whole process of the algorithm. The key point to note here is that the AST is traversed only once, using a combined-pattern automaton to match all fixers in a single pass. From an algorithmic point of view, the time complexity is O(n). Of course the extra effort needed to mark subpatterns and check the match sets for each fixer reduces somewhat this benefit, but for a sufficiently large number of patterns there should be a significant speedup.
On the issue of alternatives, I must admit I'm not very happy with my progress, as the task has proved much harder than I thought. The code is still a mess and I had to rewrite a large portion. Unfortunately I'll need some more time to get this over with, but hopefully it will be done soon.
Thursday, July 8, 2010
Status report
Wednesday, June 30, 2010
Optimization changes, somewhat promising test results
time for((c=1;c<=20;c++)) do
./test2to3.sh -f isinstance -f raise -f zip -f raw_input -f buffer \
../django-3k/django/core/servers/basehttp.py
done yielded: real 0m10.324s user 0m9.709s sys 0m0.292s The results of a control run with the original code were these:
real 0m11.237s user 0m10.701s sys 0m0.308s
(These were actually the best out of 10 runs, given time's inconsistency; results were still lower on average. I'll redo the tests with profile when I'll have more time)
That's only a slight improvement and there's a catch, basehttp.py is quite a large file. But I'm happy with the fact that it can at least run with roughly the same speed.
An issue I'm currently concerned with is that really common subpatterns may add a significant overhead when they are matched. For example )->trailer->power may be matched a few hundred times; this means that we'll have to update the match sets of perhaps 20-30 fixers, and check if the fixer matches every single time.
If we remove common subpatterns, it will probably speed things up(for example, for the case of raw_input we can throw away all subpatterns except the one starting with 'raw_input'); but then the matcher won't be 100% accurate(although only in very few cases on well-formed code). It can however send candidate nodes to the existing matcher(that's what it does anyway right now to fill the pattern variables) which can reject them. Plus this could also deal with negated patterns; just ignore them and the original matcher will do the work. I'll give it a try next week to see if it justifies the added complexity.
Plus I changed the blog's template to something that won't hurt your eyes as much.
Monday, June 28, 2010
First version of the bottom matcher
What works:
- raw_input, buffer etc. all fixers that can have their wildcards reduced
- isinstance, and all fixers with wildcards at the bottom
What doesn't:
- Alternatives: they seem easy to implement, will be added next week
- Wildcards at higher levels of the pattern: somewhat rare, left out for now (see the ws_comma for a case)
- Negative patterns: matching them should prevent the whole pattern from being matched. They rely on alternatives, so they are scheduled after them
The code is a bit hackish at points, using module globals to grab the AST leaves etc. But as I said, it's a draft. No optimization effort has been made since this was mainly a proof of concept. I'm planning for a rewrite when most functionality is complete, and the algorithm is proven to be faster.
Still, I was a bit disappointed that it ran slightly slower than the current version. I was expecting at least a tiny improvement. I expect the automaton to pick up some speed when it's properly simulated by a table, as it is currently run by traversing linked nodes. Plus we can use threads to... OK, I'm just kidding:)
The code enters at refactor.py:RefactoringTool.refactor_string, where refactor_tree is replaced with the new refactor_tree_btm. Some (horrible) changes have been made to pytree.py. The code can be found at the repository: http://code.google.com/p/2to3-speedup2/
Tuesday, June 22, 2010
A preview of the fixer automaton and current progress
The pattern was power< name='raw_input' trailer< '(' [any] ')' > any* > . The pattern was reduced(by hand) to it's minimum subpatterns:
{ '('->trailer->power, raw_input->power, ')'->trailer->power }
from which the above automaton was built. Now consider adding the very similar pattern for the buffer fixer. The resulting automaton would look like this:
The automaton begins for each leaf, and when each subpattern is matched the current node is marked with its id. When all subpatterns match on a single node for a fixer, the whole fixer matches. The patterns for other fixers will be added on the same automaton. There are some subtleties here: wildcard subpatterns need to be split at the wildcard positions and added as new subpatterns on all nodes starting with the next token, plus once on the root; alternatives must have their destination nodes combined on a single subpattern; repeaters with constant numbers need to be multiply matched etc. But overall, I believe that each issue can be handled without restricting the current fixer syntax.
Sunday, June 13, 2010
Progress Report
Of course running this single fixer with the current algorithm took about 0.018 seconds. But the new algorithm should behave much better when more fixer patterns are added, with processing time remaining close to the current level, so all in all I find this promising.
Let me explain. In theory, bottom-up tree traversal takes number_of_leaves*tree_height steps. Say b is the maximum number of children in a node and h is the height then this becomes (b^h)*h. h is about logn so the the traversal is in the O(n*logn) class. Of course the constant numbers depend on the characteristics of each tree. So, since the automaton runs in constant time on each step the number of comparisons is still O(n*logn).
The current version of the algorithm runs in O(n*pattern_nodes). Since pattern_nodes can be considered a constant number compared to n, this is O(n). Repeaters can add an overhead here when backtracking, depending on their position on the pattern_tree. But let's also consider this a constant factor, just for simplicity(I believe they add a b^h*h factor too; I have no time to check this).
The important thing to note is that the algorithm depends on the sizes of the patterns, which means that matching takes longer each time a pattern is added. The proposed algorithm is independent of the pattern sizes(except for the preprocessing part), so this is why I expect this number to stay close to the current level.
While writing this prototype I came to the realization that it is unnecessary to store the leaf-to-root strings; we can simply keep a set of the tree leaves and traverse them by asking each one for its parent. When a fixer transforms the tree, it currently marks the nodes that it changed. This can be extended to propagate the "changed" label to its node's children, and we can use this information to recheck only their descendant leaves.
A couple of other realizations I came to: patterns with wildcards need to be split up, then the subpatterns are added to the automaton skipping x states, where x is the subpattern's position. I'm working on the details of the implementation. Second, there is a great potential for threading on this algorithm, each thread can simulate the automaton on a different leaf in parallel so this will be something to consider in the future.
Monday, June 7, 2010
Some benchmarking results, and an implementation plan
No surprises here, the program spends most of its in the pattern matching functions, as it was already known(the green boxes down-left). Since a significant overhead in python is function calls, I'm still confident that a more iterative pattern matching algorithm that will avoid the number of recursions will be able to speed things up. Anyway, as a first reference here are the results of time:
Command being timed: "/usr/local/bin/python3 /home/giorgos/dev/gsoc/django-3k/setup.py build" User time (seconds): 386.96 System time (seconds): 0.87 Percent of CPU this job got: 97% Elapsed (wall clock) time (h:mm:ss or m:ss): 6:37.14 ....Note that I've built python with threads disabled, because they often mess up the profiler. But it shouldn't matter, results are always relative and threads are not used internally in 2to3.
Now, analyzing the algorithm I came up with, I'm pretty sure that theoretically it should offer no benefit compared to the existing one, i.e. I believe it runs the same number of comparisons. Of course preprocessing a few parts may help overall, but still. On the other hand, as I mentioned above I think a major issue in the existing algorithm is its recursive nature; the algorithm I propose may benefit from being a bit more "linear".
Anyway, running a few tests it seems to me that the most (practically) computationally intensive part is obtaining the bottom-up strings, and doing so every time the tree is altered. Profiling showed that it can take up to 0.5 secs to obtain the full set; that's quite a lot, and usually 1-2 fixers are applied per file so multiply that by 2. That's not so bad, but I'll see in the future if there is a faster way to obtain it.
So a portion the algorithm in pseudocode currently is something like this:
get_bottom-strings: #Bottom strings will be lists of the format[(NodeId, TOKENTYPE), ...]. for leave in leaves: return leave.type + parent.bottom_string() #recursively obtain bottom-up strings get_pattern_dict: for fixer in fixers: reduce_repeaters(fixer.pattern_tree) # reduce repeaters to obtain a constant tree -> WildcardPattern is reduced to its MIN; this may actually create nodes for leave in pattern_tree.leaves: obtain bottom-up string return pattern_dictThen we use the pattern_dictionaries to build the AC automaton. Now simulating the AC automaton in python may be too slow for our purposes here, so perhaps a C version could be considered in the future if that's an option. The automaton is constructed by adding each string at a time as detailed in the Aho-Corasick paper, and nodes will be represented by a ACNode type with a dictionary as a transition table. On each final node callbacks will be added to mark the position of the match. Finally a match dictionary with nodeId as the key will be used to check whether at any point a fixer has been matched.
So that pretty much covers it. I'm at the stage of planning how to fit all these in the existing code. I hope I'll have some time this week(my term exams start on Monday) to push this further.
Tuesday, May 25, 2010
A bottom-up tree matching algorithm
So the first step in the algorithm is to create those leaf-to-root strings, by keeping a record of all leaves and propagating up the tree. In a common python file there are typically a few thousand leaves with an average height of 50-60 nodes. The process can have a significant overhead since it must be run once for every file but the final result may be worth it.
Now the main idea behind this algorithm is reducing the search space. We can extract from each complex pattern a "minimum" set of constant pattern paths that that we can use to mark candidate nodes. Say for example we have the same pattern tree for the print stmt:
The minimum constant path in this case is simple_stmt->print_stmt, which we obtain by reducing the wildcard patterns to their minimum sufficient pattern(in this case none). After reducing the repeater patterns we can obtain the path dictionary as in Algorithm D only this time we'll be writing them backwards.
For simplicity lets consider only the above pattern and the single path {print_stmt->simple_stmt}. We proceed by combining the constant patterns into a single Aho-Corasick automaton that resembles the one in Algorithm D. This part of the algorithm can be preprocessed and updated only when the fixer patterns change, and the resulting automaton can be pickled.
The resulting automaton is run on each leaf-to-root string and the nodes are marked when each constant path is matched. When the set of sufficient pattern paths is matched we run the existing matching algorithm on this node; this step is necessary because wildcard patterns have to still be matched exactly in order to return the correct content for each variable.
Is this process worth it? I haven't analyzed it yet. Traversing a tree from the bottom is inefficient since we run into the same nodes again and again. On the other checking linear patterns requires less time than tree patterns. Furthermore storing the strings as real immutable strings we can take advantage of the optimizations the interpreter provides, unlike running recursive calls to traverse the tree. Plus preprocessing the fixer patterns into an automaton can save some time.
The next step will be to formally analyze the algorithm and obtain some preliminary profiling results.
Why Algorithm D won't work
When I was hastily making my proposal for GSOC, I came upon Hoffman & O'Donnell's paper "Pattern Matching in Trees". The paper's subject seemed to fit perfectly 2to3's matching problem and I assumed that the algorithm they proposed, Algorithm D, could be somehow extended to include repeaters like * and +. After further research, I realized I was wrong; their algorithm depends on pattern tree nodes having a fixed arity, i.e. no repeaters.
To explain this a bit further, Algorithm D works like this: The pattern tree is converted into a trie, then a dictionary of paths is formed from the trie and an Aho-Corasick automaton is created from the dictionary. The final states of the automaton define the matching state; when all path-states have been reached the pattern has been matched.
Here's a demonstration to make this clear:
This is a pattern tree and the resulting trie. The pattern will match any print_stmt that follows any other statement(any is a wildcard).
The numbers correspond to a fixed position in the tree, that is starting from root we follow transition 1 when we encounter the first child, transition 2 for the second and so on.
But how could this pattern, that matches a print_stmt in any position, be converted into an automaton?
I was confident that the problem could be somehow worked around and extend the automata to include repeaters. No matter how much I tried I admit that I'm out of wits here. Even more, I have started to think that the top-down automaton is not capable of matching repeaters, because it somehow resembles counting; think about it, if the tree was in linear form and had deep paths left and right, say
simple_stmt( ((((....)))), print_stmt, ((((...)))) )
then you'd have to keep a count of the left&right parentheses: the classic case that exceeds the capabilities of finite automata.
Not all hope is lost however; I have thought of a different algorihm that will still use the Aho Corasick automaton but in a bottom-up manner. More details will follow in the next article