Exams start tomorrow and I've spent most of this week studying. But I've had the chance to start coding a prototype of the matching algorithm I proposed. It is completed to the point where it simulates the raw_input fixer pattern. Profiling results look promising; for example for this one fixer it took about 0.087 CPU seconds on a single file(preprocessing aside).
Of course running this single fixer with the current algorithm took about 0.018 seconds. But the new algorithm should behave much better when more fixer patterns are added, with processing time remaining close to the current level, so all in all I find this promising.
Let me explain. In theory, bottom-up tree traversal takes number_of_leaves*tree_height steps. Say b is the maximum number of children in a node and h is the height then this becomes (b^h)*h. h is about logn so the the traversal is in the O(n*logn) class. Of course the constant numbers depend on the characteristics of each tree. So, since the automaton runs in constant time on each step the number of comparisons is still O(n*logn).
The current version of the algorithm runs in O(n*pattern_nodes). Since pattern_nodes can be considered a constant number compared to n, this is O(n). Repeaters can add an overhead here when backtracking, depending on their position on the pattern_tree. But let's also consider this a constant factor, just for simplicity(I believe they add a b^h*h factor too; I have no time to check this).
The important thing to note is that the algorithm depends on the sizes of the patterns, which means that matching takes longer each time a pattern is added. The proposed algorithm is independent of the pattern sizes(except for the preprocessing part), so this is why I expect this number to stay close to the current level.
While writing this prototype I came to the realization that it is unnecessary to store the leaf-to-root strings; we can simply keep a set of the tree leaves and traverse them by asking each one for its parent. When a fixer transforms the tree, it currently marks the nodes that it changed. This can be extended to propagate the "changed" label to its node's children, and we can use this information to recheck only their descendant leaves.
A couple of other realizations I came to: patterns with wildcards need to be split up, then the subpatterns are added to the automaton skipping x states, where x is the subpattern's position. I'm working on the details of the implementation. Second, there is a great potential for threading on this algorithm, each thread can simulate the automaton on a different leaf in parallel so this will be something to consider in the future.
Taking in consideration that this work is mostly CPU bound, do you think that making it threaded will improve its performance?
ReplyDeleteUsing multiprocessing could utilize an idle core, and I think that this part of the algorithm lends itself better to parallelization rather than the current one-process-per-file level. In any case, it seems easy to try and even a small improvement could be worth it.
ReplyDelete