When I was hastily making my proposal for GSOC, I came upon Hoffman & O'Donnell's paper "Pattern Matching in Trees". The paper's subject seemed to fit perfectly 2to3's matching problem and I assumed that the algorithm they proposed, Algorithm D, could be somehow extended to include repeaters like * and +. After further research, I realized I was wrong; their algorithm depends on pattern tree nodes having a fixed arity, i.e. no repeaters.
To explain this a bit further, Algorithm D works like this: The pattern tree is converted into a trie, then a dictionary of paths is formed from the trie and an Aho-Corasick automaton is created from the dictionary. The final states of the automaton define the matching state; when all path-states have been reached the pattern has been matched.
Here's a demonstration to make this clear:
This is a pattern tree and the resulting trie. The pattern will match any print_stmt that follows any other statement(any is a wildcard).
The numbers correspond to a fixed position in the tree, that is starting from root we follow transition 1 when we encounter the first child, transition 2 for the second and so on.
But how could this pattern, that matches a print_stmt in any position, be converted into an automaton?
I was confident that the problem could be somehow worked around and extend the automata to include repeaters. No matter how much I tried I admit that I'm out of wits here. Even more, I have started to think that the top-down automaton is not capable of matching repeaters, because it somehow resembles counting; think about it, if the tree was in linear form and had deep paths left and right, say
simple_stmt( ((((....)))), print_stmt, ((((...)))) )
then you'd have to keep a count of the left&right parentheses: the classic case that exceeds the capabilities of finite automata.
Not all hope is lost however; I have thought of a different algorihm that will still use the Aho Corasick automaton but in a bottom-up manner. More details will follow in the next article
No comments:
Post a Comment