develooper Front page | perl.perl5.porters | Postings from January 2010

Re: [perl #56444] (5.12 blocker) delayed interpolation of \N{...} charnames escapes in regexes in perl 5.9.x and later causes breakage

Thread Previous | Thread Next
From:
demerphq
Date:
January 26, 2010 13:39
Subject:
Re: [perl #56444] (5.12 blocker) delayed interpolation of \N{...} charnames escapes in regexes in perl 5.9.x and later causes breakage
Message ID:
9b18b3111001261339x5316f482jbf4b1a01e4c63e2e@mail.gmail.com
2010/1/26 Tom Christiansen <tchrist@perl.com>:
> Yves gmailed:
>
>>>> They can be multi.
>
>>> They can?  Is this due solely to Perl's uses of aliases in via
>>> \N{} charnames, or does the Unicode standard somewhere specify
>>> that this be supported?
>
>> I think it was because it seemed like a good idea at the time
>> and maybe it could be useful.
>
> Oh, it does, it does.  That's why I showed my example.
> I just hadn't considered the price one might have to pay.

I suspect many of perls gnarlier features come from just that problem. :-)


>>> I confess I've wanted a way to simply say something like
>>>
>>>  use charnames qw[:full :alias] => {
>>>    LAQ          =>   "SINGLE LEFT-POINTING ANGLE QUOTATION MARK",
>>>    RAQ          =>   "SINGLE RIGHT-POINTING ANGLE QUOTATION MARK",
>>>    d_dental     => [ "LATIN SMALL LETTER D", "COMBINING BRIDGE BELOW" ],
>>>    b_approx     => [ "GREEK SMALL LETTER BETA", "COMBINING DOWN TACK BELOW" ],
>>>    aw_chicago   => [ "SMALL LETTER OPEN O", "COMBINING DIAERESIS",
>>>                      "COMBINING DOWN TACK BELOW" ],
>>>    dezh_breath  => [ "LATIN SMALL LETTER DEZH DIGRAPH",
>>>                      "COMBINING DIAERESIS BELOW" ],
>>>    tezh_affric1 => [ "LATIN SMALL LETTER TESH DIGRAPH",
>>>                      "MODIFIER LETTER SMALL H" ],
>>>    tezh_affric2 => [ "LATIN SMALL LETTER T",
>>>                      "COMBINING DOUBLE INVERTED BREVE",
>>>                      "LATIN SMALL LETTER ESH",
>>>                      "MODIFIER LETTER SMALL H" ],
>>>    tezh_affric  =>   "tezh_affric1",
>>>  };
>>>
>>>  print "Affricated tezh #1 is \N{LAQ}\N{tezh_affric}\N{RAQ}\n";
>>>  print "Affricated tezh #2 is \N{LAQ}\N{tezh_affric2}\N{RAQ}\n";
>>>
>>> and have that "just work" without having to go writing a module
>>> with a custom import that uses
>>>
>>>   $^H{charnames} = \&translator;
>
>>>>> I was planning to add support for named sequences (where a multi-code
>>>>> point string is returned) for 5.14, and have worked out an easy way
>>>>> to get character classes to allow them, namely to use a transformation
>>>>> based on the fact that [a-c] is a shorthand for a|b|c, so that
>>>>> /[(?:\x{F001}\x{F002})] is (?:\x{F001}\x{F002})
>
> That's pretty costly--more than an order of magnitude--under the current
> system; see below.  Do you have some overhaulish plans for that?
>
>>>>   /[\N{foo}]/
>>>
>>>> would convert into
>>>
>>>>   /[\N{U+1:2:3:4}]/
>
>>> I can easily generate \N{U+1.2.3.4} via
>>>
>>>    printf "\\N{U+%04vX}", $str;
>>>
>>> but not the colon version.  Given that code points are non-
>>> negative integers, is there a reason to prefer the colon
>>> version over the dot version?
>
>> Er, no not really. I just thought it looked better than commas and it
>> didnt occur to me to try dots. And well, no reason we cant TMTOWTDI :-)
>
> True, that.

And well in line for traditions sake. :-)


>>>> Additionally the non-charclass part of the engine engine would need
>>>> to treat \N{U+1:2:3:4} the same as (?:\x{1}\x{2}\x{3}\x{4}) but that
>>>> should be fairly easy I think.
>
>>> I should think so.  Some are harder though, like these
>>>
>>>     [X[:digit:]Y[:^word:]\N{many}Z]
>>>    [^X[:digit:]Y[:^word:]\N{many}Z]
>>>
>>> becoming something like
>>>
>>>       (?:X|[:digit:]|Y|[:^word:]|\N{many}|Z)
>>>    (?:(?!X|[:digit:]|Y|[:^word:]|\N{many}|Z)(?s:.))
>
>> Im not getting you actually.
>
> I was winging conversion of [abc] and [^abc] to alternation.  The first
> is easy as (?:want|this), but the second requires a notted alternation,
> which I've always coded as something like (?:(?!dont|want)(?s:.))

Ah, I was thrown off becuase the [:^word:] is a special case in the
regex engine anyway.

We dont (just) store all the characters involved. We store a bit that
says "we match non word characters"

>> it could be
>>
>>    [X[:digit:]Y[:^word:]${first_of_many}Z](?:(?<=${first_of_many})$rest_of_many)
>
> I can see why separating out the first character might be more efficiently
> executed, but shouldn't a good trie implementation do that on its own?

Oh yes. i was just thinking of the non-trie case.

>> It could also be internally:
>>
>>    (?:[X[:digit:]Y[:^word:]Z]|\N{many})
>
>> Alternatively, consider that a characterclass is equivalent to
>> a trie with depth 1. So charclasses could be eliminated/merged
>> with tries (they nearly are already) and then this would just
>> be a trie node constructed directly.
>
> Mapping [abc] to (?:x|y|z) is straightforward enough, and with
> a good trie implmentation should be reasonably efficient.  But
> (naïve) alternation still comes out to be severely expensive.
>
>    use Benchmark qw[ :hireswallclock cmpthese ];
>    $str = "=" x 120 . "super" x 4;
>    cmpthese($count, {
>       CharClass    => q{ $str =~ /[a-z]+/ },
>       Alternation  => q{ $str =~ /(?:a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)+/ },
>    });
>
> Gives:
>
>    Platform A:         5.8.8     5.10.0     5.11.3     5.11.4
>        CharClass   1496772/s  1284750/s  1070296/s  1047791/s
>        Alternation   13816/s    39727/s    38128/s    35856/s
>
>    Platform B:         5.8.8     5.10.0     5.10.1     5.11.3
>        CharClass    238432/s   233372/s   226796/s   178877/s
>        Alternation    3377/s    10223/s     9930/s     9816/s
>
> So casually converting to alternation has a big performance hit.

Which version of perl is this? This shouldnt construct a trie at all,
or not a full one. And in that case it should be nearly as fast.

Also, it seems to me you are comparing the time to construct the
pattern as well. That is probably not wise and would probably swamp
the benefit of using a trie.

> With Unicode properties, it can be even worse.
>
>    use Benchmark qw[ :hireswallclock cmpthese ];
>    $str = "=" x 120 . "super" x 4;
>    cmpthese($count, {
>       CharClass   => q{ $str =~ /[a-z]+/ },
>       Alphabetic  => q{ $str =~ /\p{Alphabetic}+/ },
>       Latin       => q{ $str =~ /\p{Latin}+/ },
>    });
>
> Gives:
>
>    Platform A:        5.8.8     5.10.0     5.11.3     5.11.4
>        CharClass  1505280/s  1282982/s  1073828/s  1058097/s
>        Latin       147552/s   146616/s    52130/s    52463/s
>        Alphabetic  146457/s   145691/s    52118/s    52049/s
>
>    Platform B:        5.8.8     5.10.0     5.10.1     5.11.3
>        CharClass   241630/s   231872/s   231226/s   175170/s
>        Latin        28525/s    29563/s     7204/s     5655/s
>        Alphabetic   28291/s    29513/s     7139/s     5303/s

Thats not a great comparison unfortunately. The way that unicode
charclasses are handled is *extremely* inefficient. And a unicode
property is really just a special charclass.

>
> Apart from the big Unicode hit, (quite nearly) everything keeps
> getting ever sl.o..w...e....r, as $] increases.  I shouldn't
> care to project that trend-line out a few more versions. :(

Yes this is true. However it also gets more correct, and more
powerful. And does things it didnt used to do.

So for instance we used to be faster but with the trade off that a
regex could segv your perl and uncatchably terminate your process.
That doesnt happen anymore. We used to do a number of things that were
actually wrong, and fixing stuff often has a negative performance
impact. Its really easy to be fast if you dont have to worry about
correctness.

Having said that we do have some very serious breakage. For instance
the superlinear cache was broken when we de-recursivized the regex
engine, and nobody has had the combination of the tuits, inclination,
and skills to fix it.

>>> although I imagine those would really be \p{prop}, \P{prop},
>>> and \N{U+c1:c2:c3} (well, or \N{U+c1.c2.c3} ), or lastly
>>> some (?u:\x{C1}\x{c2}\x{c3}) equivalent.
>
>> Like i said, (?u:...) doesnt save us here. We need a new *escape* for
>> this purpose, as (?u:...) would always be a legitimate content in a
>> charclass, and the time we would do the transformation would occur at
>> such an early time that we couldnt tell we were in a charclass, or so
>> late that we couldnt tell that the string was originally an \N{..}.
>
> Drat.
>
>> So we have to use a new/modified escape sequence And for the reasons
>> that Karl pointed out earlier that either means something new, or
>> using \N{U+...}.
>
> Ok.

We have metphorically painted ourselves in a corner. Luckily it is a
hyperroom and if we can slide off into the 7th dimension we probably
can get out without messing up our shoes (or our paint job).

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About