Front page | perl.perl5.porters |
Postings from March 2012
Re: \Questions about the \Future of \Escapes
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
March 4, 2012 20:07
Subject:
Re: \Questions about the \Future of \Escapes
Message ID:
4F543BC8.3030602@khwilliamson.com
I'm wondering if anything about this issue should go into the v5.16
perldelta, such as in the Future Deprecations section.
On 01/15/2012 02:37 PM, demerphq wrote:
> On 13 January 2012 19:31, Nicholas Clark<nick@ccl4.org> wrote:
>> On Thu, Jan 12, 2012 at 12:53:03PM +0100, demerphq wrote:
>>> On 11 January 2012 16:18, Nicholas Clark<nick@ccl4.org> wrote:
>>>>> FWIW, I got pretty far with pushing \u\U\l\L\E into the regex engine,
>>>>> but it ran into some issues. We probably want to let the toker handle
>>>>> \Q \E, as otherwise (?{ ... }) gets really tricky. However in order to
>>>>> support \Q "properly" the toker must know about the \U, \L and \E's as
>>>>> well.
>>>>
>>>> \Q in the tokeniser, \L etc elsewhere, and both sharing \E troubles me.
>>>> This may be unfounded on my part. I'd also hate for qr// and "" to
>>>> diverge further - ie do double quoted strings need a similar imposition
>>>> of sanity?
>>>
>>> I think if we are going to change it we are going to have to let the
>>> regex engine handle \Q as well. I think Dave's recent work (see
>>> comment in this thread) may make things easier.
>>
>> Yes. In that I think conceptually \Q belongs with \U and \L (and \F) as
>> they all fight over \E
>>
>> What's bugging me somewhat is that if \Q, \U, \L (and \F) (and \l and \u)
>> move to be implemented the regex engine, then that seems to be code
>> duplication, as the same escapes have to be implemented in the tokeniser
>> for "".
>
> This is already the case for pretty much all escapes. They have to be
> implemented in both because even if the toker DOES handle them, one
> can synthesize the escape anyway, and then feed it to the regex
> engine. So for instance even when the toker handled \x{} the regex
> engine also had to handle it. I think that it makes sense to include
> the casemod escapes as well.
>
> This should work just fine:
>
> my $esc="\\";
> my $hex="DF";
> print "yay" if "\x{DF}" =~ /$esc\{$hex\}/;
>
>> Which is making me wonder, instead, whether it works to move \X and \N
>> (back) to the tokeniser for regular expressions, provided that the (internal)
>> regular expression API has become flexible enough (not that it takes a
>> list of things, not a simple scalar) to be passed "literal string"
>> (which would effectively be used for \X, \N and I guess as it would now
>> exist, the guts of \Q...\E)
>
> I do not think it does, mostly for the same reason I mention above.
>
>> But, was there one reason or two reasons that \X and \N moved from the
>> tokeniser? One IIRC was to stop /qr\x{2E}/ being the same as qr/./
>
> Afaik this was the reason. Except it wasnt a matter of /moving/ \x{}
> to the regex engine, it was of *removing* it from the toker for
> regexes. Again for the pattern synthesis example.
>
>> But was there a second related to some values for \N{} mapping to multiple
>> code points, which can't be expressed well in a string passed from toke.c
>> to regcomp.c ?
>
> I think you have it sort of backwards. Moving \N{} handling completely
> out of the toker for regexes meant that when concatenating a pattern
> using an \N{} with a lexically scoped definition there would be
> trouble when concatenating that regex with something else in a
> different scope where the \N{} had no meaning. I believe we went with
> the strategy that the toker would essentially convert the \N{} with a
> lexically scoped name into a special construct that had a constant
> value. Then the regex engine would "see" the converted value, and
> later on when it was concatenated do the right thing.
>
>>
>>>>> Here are the rules:
>>>>>
>>>>> \U and \L case-modify non-casemod text until the end of the string or
>>>>> the next relevant encountered \E, if there is already an unterminated
>>>>> \L or \U in effect then the new \U or \L will end any still in effect
>>>>> casemodifiers (note: this is not a typo \Q does not end any previous
>>>>> \Q \U or \L, but \U and \L do end any previous \Q).
>>>>>
>>>>> $ perl -le'print "\U[one]\Q[TWO]\L[THREE]"'
>>>>> [ONE]\[TWO\][three]
>>>>> $ perl -le'print "\Q[one]\U[TWO]\L[THREE]"'
>>>>> \[one\]\[TWO\]\[three\]
>>>>
>>>> I think you misdescribed just the the parenthesised note, as your text
>>>> contradicts your second example. \U and \L terminate any previous \U or \L,
>>>> acting as an implicit \E at that point.
>>>
>>> I don't think the second example contradicts my parenthesised note.
>>> The point was that the \L does terminate the \U but does not terminate
>>> the \Q, because the \Q is *in front* of the \U. However the first
>>> example shows how the \L terminates both the \U and \Q because the \Q
>>> is after the \U.
>>>
>>> I was trying to say that \U and \L terminate any previous \U or \L
>>> should they exist and when they do also terminates *anything* that
>>> came after the previous \L or \U.
>>
>> Yes, that's what I missed. I guess this means that your phrasing was
>> ambiguous (given that I was able to mis-interpret it).
>
> Indeed. Speaking clearly about this stuff is hard. Describing madness
> rationally is difficult :-).
>
>> Hence maybe change it to:
>>
>> (note: this is not a typo \Q does not end any previous \Q \U or \L, but
>> \U and \L do end any previous \Q if they are also ending a previous \L
>> or \U)
>>
>> although my wording feels clunky to me.
>
> This is an improvement tho. Thanks.
>
>> That also feels somewhat like an implementation bug. Purposefully without
>> looking at the code, that sounds like \U or \L save the current state,
>> and *if* \U or \L is active, such that the next \U or \L needs to act as
>> an implicit \E, then it restores state, where that "state" includes whether
>> \Q is active. (Instead of just restoring the state of casefolding)
>
> Having looked at the code I can say that your description is pretty
> spot on. I might quibble as to what sort of bug it was, I dont think
> it was an oversight, more like a deliberate side-effect of the chosen
> implementation. Consider what /\Ux\LY/ shoudl produce:
>
> uc("x") . lc("Y");
>
> If we put a \Q in between /\Ux\Q \LY/ we get this:
>
> uc("x" . quotemeta(" ")) . lc("Y");
>
> If we dont terminate the quotemeta then we would end up with this:
>
> uc("x" . quotemeta(" ". lc("Y")))
>
> which would not lc() the "Y" (because of the uc).
>
>> This just doesn't seem sane. Either \U, \L (and \F) should work orthogonally
>> to \Q, or any of the three (four) should terminate the others.
>
> I was thinking of trying to make it so that we would produce:
>
> uc("x". quotemeta(" ")) . lc(quotemeta("Y"))
>
> Effectively require an \E to finish a \Q.
>
> BTW, this reminds me about the whole warning on useless use of \E. If
> we are going to warn about that should we not also warn about useless
> use of \l and \u? For instance in /\Ufoo \lbar/ the \l is useless, we
> will evaluate something like this:
>
> uc("foo " . lcfirst("bar"))
>
> So if we /are/ going to warn on \E then we probably should warn on
> useless use of \l and \u too.
>
>>>>> \l and \u case-modify the next non-casemod text in the string, or
>>>>> nothing if there is no non-casemod text in between it and then next \U
>>>>> \Q \L or \E. Any preceding \L or \U take precedence, except in the
>>>>> case where the \l or \u immediately follow an \L or \U in which case
>>>>> the \l or \u take precedence.
>>>>> $ perl -le'print "\lFOO"'
>>>>> fOO
>>>>> $ perl -le'print "\ufoo"'
>>>>> Foo
>>>>> $ perl -le'print "\L\ufoo \ubar"'
>>>>> Foo bar
>>>>> $ perl -le'print "\U\lfoo \lfoo"'
>>>>> fOO FOO
>>>>
>>>> That \L or \U take precedence is, um, strange, counter intuitive and less
>>>> useful than the other way round would be.
>>>
>>> And to make it more fun Ill repeat what I said to Tom C:
>>>
>>> \U\l and \L\u are both special cased to be parsed as \l\U and \u\L.
>>
>> That then makes a lot more sense, coupled with the "explain how it maps
>> to uc(), lcfirst() etc".
>
> Yes indeed. But it is also a special case.
>
> I could imagine different rules where this would not have to be a
> special case. Like the rules that would produce this:
>
> /\Ufoo \lbar/ => upper("foo ") . lcfirst(upper("bar"))
> /\Ux\Q \LY/ => upper("x" . quotemeta(" ") ) . lower(quotemeta("Y"))
> /\Ux\Q \E\LY/ => upper("x" . quotemeta(" ") ) . lower("Y")
> /\Ux\Q \lP\E\LY/ => upper("x" . quotemeta(" ") ) .
> lcfirst(upper(quotemeta("P"))) . lower(quotemeta("Y"))
>
> Maybe something like this:
>
> Case modifiers are divided into three groups:
>
> \Q => "non-case" modifiers quotemeta
> \U \L => "inner" case modifiers: uc(), lc()
> \u \l => "outer" case modifiers: ucfirst(), lcfirst()
>
> At any one time zero, or one of each may be active at each time.
> If more than one is active at any one time, then they are applied in
> order of 'non-case', 'inner', 'outer' [ that is:
> lcfirst(uc(quotemeta($x))) ]
> \Q \U \L are terminated by a nesting \E.
> \U and \L override each other, but stack.
> \Q stacks with \U and \L.
> \l and \u override each other and do not stack.
>
> So for instance:
>
> "\Ux\Q \LX\E y\lzx\E z\Et"
>
> Would evaluate to: upper("x") . quotemeta(" "). lower(quotemeta("X"))
> . upper(quotemeta(" y")) . lcfirst(upper(quotemeta("zx"))) . upper("
> z") . "t"
>
> Something like that anyway.
>
> cheers,
> Yves
>
>
>
>
>
>
>
>
Thread Previous
|
Thread Next