Tuesday, May 25, 2010

A bottom-up tree matching algorithm

Seeing a tree from the bottom up is a bit like 'unfolding' it. The tree structure is converted into a series of linear paths starting from leaf to root.

So the first step in the algorithm is to create those leaf-to-root strings, by keeping a record of all leaves and propagating up the tree. In a common python file there are typically a few thousand leaves with an average height of 50-60 nodes. The process can have a significant overhead since it must be run once for every file but the final result may be worth it.

Now the main idea behind this algorithm is reducing the search space. We can extract from each complex pattern a "minimum" set of constant pattern paths that that we can use to mark candidate nodes. Say for example we have the same pattern tree for the print stmt:



The minimum constant path in this case is simple_stmt->print_stmt, which we obtain by reducing the wildcard patterns to their minimum sufficient pattern(in this case none). After reducing the repeater patterns we can obtain the path dictionary as in Algorithm D only this time we'll be writing them backwards.

For simplicity lets consider only the above pattern and the single path {print_stmt->simple_stmt}. We proceed by combining the constant patterns into a single Aho-Corasick automaton that resembles the one in Algorithm D. This part of the algorithm can be preprocessed and updated only when the fixer patterns change, and the resulting automaton can be pickled.

The resulting automaton is run on each leaf-to-root string and the nodes are marked when each constant path is matched. When the set of sufficient pattern paths is matched we run the existing matching algorithm on this node; this step is necessary because wildcard patterns have to still be matched exactly in order to return the correct content for each variable.

Is this process worth it? I haven't analyzed it yet. Traversing a tree from the bottom is inefficient since we run into the same nodes again and again. On the other checking linear patterns requires less time than tree patterns. Furthermore storing the strings as real immutable strings we can take advantage of the optimizations the interpreter provides, unlike running recursive calls to traverse the tree. Plus preprocessing the fixer patterns into an automaton can save some time.

The next step will be to formally analyze the algorithm and obtain some preliminary profiling results.

Why Algorithm D won't work

When I was hastily making my proposal for GSOC, I came upon Hoffman & O'Donnell's paper "Pattern Matching in Trees". The paper's subject seemed to fit perfectly 2to3's matching problem and I assumed that the algorithm they proposed, Algorithm D, could be somehow extended to include repeaters like * and +. After further research, I realized I was wrong; their algorithm depends on pattern tree nodes having a fixed arity, i.e. no repeaters.

To explain this a bit further, Algorithm D works like this: The pattern tree is converted into a trie, then a dictionary of paths is formed from the trie and an Aho-Corasick automaton is created from the dictionary. The final states of the automaton define the matching state; when all path-states have been reached the pattern has been matched.


Here's a demonstration to make this clear:

This is a pattern tree and the resulting trie. The pattern will match any print_stmt that follows any other statement(any is a wildcard).

The path dictionary would be {simple_stmt1, simple_stmt2print_stmt}. The resulting automaton is easily constructed from the trie and it would look like this(fail transitions are omitted):


The numbers correspond to a fixed position in the tree, that is starting from root we follow transition 1 when we encounter the first child, transition 2 for the second and so on.
But how could this pattern, that matches a print_stmt in any position, be converted into an automaton?



I was confident that the problem could be somehow worked around and extend the automata to include repeaters. No matter how much I tried I admit that I'm out of wits here. Even more, I have started to think that the top-down automaton is not capable of matching repeaters, because it somehow resembles counting; think about it, if the tree was in linear form and had deep paths left and right, say

simple_stmt( ((((....)))), print_stmt, ((((...)))) )

then you'd have to keep a count of the left&right parentheses: the classic case that exceeds the capabilities of finite automata.

Not all hope is lost however; I have thought of a different algorihm that will still use the Aho Corasick automaton but in a bottom-up manner. More details will follow in the next article

Followers