On 30 December 2016 at 20:17, Karl Williamson <public@khwilliamson.com> wrote: > On 12/29/2016 07:59 AM, Zefram wrote: >> >> Dave Mitchell wrote: > Node boundaries introduce bugs with folding. qr/[sS][sS]/i currently > doesn't match ß because of the separate nodes. So your trick of > transforming /(abc|defg)vwxyz/ into [ad][be][cf][vg][wv][xw][yx][zy] > is not always valid, depending on the particular things a,b,c...xyz are. I am not keen on this optimisation, especially as it essentially creates a trie out of charclasses. I would prefer we just make the TRIE code faster by not making it support huge alphabets and eliminating the compressed table representation. And in fact IMO the proposed optimisation here would fall afoul of the same problem that forced the compressed table representation. What would happen if the alternation had a huge alphabet, and a huge number of branches? With better trie and optimisation logic we could then turn this into: (abcvwxyz|defgxvwxyz). In a trie, which IMO would be more efficient than turning it into a charclass. Anyway, a lot of these optimisations are IMO very difficult because we do not use a sane compiler process. What we do is relatively fast, but it makes optimisation hugely difficult. As I mention in a different reply I have been working on a perl based regex optimising parser, which constructs a proper optree and then transforms and optimises the optree. If we had such a thing in C then we could do MANY things we currently cannot. YvesThread Previous | Thread Next