George's GSOC Blog: Progress for this week

Just a quick update before today's meeting, I've spent the last week trying approaches to support alternatives and I had to start from scratch twice. I'll certainly won't use the word easy so freely again.

First of all realized that it's impossible to deduct the full pattern structure reliably from the compiled pattern, so spent much of my time into rewriting this part to deduct the subpatterns from the intermediate tree that the pattern compiler generates. There were some challenges, but I think I've reached the point where patterns can be extracted reliably.

Now, my first approach was converting a pattern with alternatives into multiple simpler patterns, creating n^b new patterns, n being the nesting factor and b the number of alternatives per level of nesting. Needless to say that this quickly exploded the number of patterns so I quickly abandoned it.

The second approach was adding multiple transitions at nodes that correspond to alternatives. This approach is a bit more complex; for one thing there are cases where a single pattern is alternative to a group of patterns

i.e. ( a | b c). This means that 'a' is equivalent to matching 'b' and 'c'; this goes over the automaton's capabilities since we need to count. The same happens for (a | b<c >), which requires extra nodes, let alone (a | b<c> d<e>) and so on. And on top of all these, any of them could be a wildcard. The cases may be few, but there are enough. Attempting to implement this approach was frustrating; there were always more and more cases that kept breaking the code. Plus the complexity of the whole thing gave me the impression that it would really slow down the matcher.

So after thinking this over, I've realized that by using the 'characteristic pattern' concept I had mentioned could simplify alternatives support, since only a single linear pattern will be matched with multiple transitions for each group of alternatives, eliminating most of the complexity of the previous approach. At this point the code is already able to extract this single patterns; it turns out that it is much easier than I thought at first and the results look pretty good. Most fixer patterns contain either an uncommon keyword or character so the simple approach of preferring these subpatterns yields satisfying results. For example the result for the fix_filter pattern is:

[(['lambda', 300, 259, 332, 308], ['filter, 308'], ['filter', 308])]

The tuple means 'a group of alternatives'. The last two alternatives can be merged, so we are left with only two. Using these subpatterns to match this fairly complex fixer means that we only check the cases that start with these not-so-common keywords, 'lambda' and 'filter'. Now, as far as alternatives are concerned the logic behind this is to add those subpatterns as if they weren't alternatives, and merge their final states. Having a single subpattern to match also skips the step of checking for a complete match set for a fixer, since a single match is now adequate.

Of course this means that the accuracy of the matcher will be somewhat reduced and more candidate nodes will be marked for further matching, i.e. buffer('test') but also buffer.test(); but obviously this is not common and from what I've seen so far it won't be much of a concern. A benefit of using a non-accurate matcher is that it simplifies the handling of negative patterns: they are treated as wildcards and the actual matcher is responsible for sorting them out.

So, overall, I think that I'm finally getting closer to a matcher that works for almost all fixers(with the exception of the explicit ws_comma and idioms fixers, which use wildcards at higher levels).

George's GSOC Blog

Tuesday, July 20, 2010

Progress for this week

No comments:

Post a Comment

Followers

Blog Archive