develooper Front page | perl.perl5.porters | Postings from December 2016

Re: some thoughts on 8-bytes-at-a-time in the regex engine

Thread Previous | Thread Next
From:
demerphq
Date:
December 31, 2016 15:36
Subject:
Re: some thoughts on 8-bytes-at-a-time in the regex engine
Message ID:
CANgJU+WKy=1wPxmbOZNZY6FXP6scrD2jvhSD006YWu5ABY3Y1Q@mail.gmail.com
On 30 December 2016 at 20:17, Karl Williamson <public@khwilliamson.com> wrote:
> On 12/29/2016 07:59 AM, Zefram wrote:
>>
>> Dave Mitchell wrote:
> Node boundaries introduce bugs with folding.  qr/[sS][sS]/i currently
> doesn't match ß because of the separate nodes.  So your trick of
> transforming /(abc|defg)vwxyz/ into [ad][be][cf][vg][wv][xw][yx][zy]
> is not always valid, depending on the particular things a,b,c...xyz are.

I am not keen on this optimisation, especially as it essentially
creates a trie out of charclasses.

I would prefer we just make the TRIE code faster by not making it
support huge alphabets and eliminating the compressed table
representation.

And in fact IMO the proposed optimisation here would fall afoul of the
same problem that forced the compressed table representation. What
would happen if the alternation had a huge alphabet, and a huge number
of branches?

With better trie and optimisation logic we could then turn this into:

(abcvwxyz|defgxvwxyz).

In a trie, which IMO would be more efficient than turning it into a charclass.

Anyway, a lot of these optimisations are IMO very difficult because we
do not use a sane compiler process. What we do is relatively fast, but
it makes optimisation hugely difficult.

As I mention in a different reply I have been working on a perl based
regex optimising parser, which constructs a proper optree and then
transforms and optimises the optree. If we had such a thing in C then
we could do MANY things we currently cannot.

Yves

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About