develooper Front page | perl.perl5.porters | Postings from January 2012

Re: \Questions about the \Future of \Escapes

Thread Previous | Thread Next
From:
demerphq
Date:
January 15, 2012 13:37
Subject:
Re: \Questions about the \Future of \Escapes
Message ID:
CANgJU+Vr4NCpdxDdmcf2-6R2wF0BwrykgH3_ww9tUDqdnwo+mQ@mail.gmail.com
On 13 January 2012 19:31, Nicholas Clark <nick@ccl4.org> wrote:
> On Thu, Jan 12, 2012 at 12:53:03PM +0100, demerphq wrote:
>> On 11 January 2012 16:18, Nicholas Clark <nick@ccl4.org> wrote:
>> >> FWIW, I got pretty far with pushing \u\U\l\L\E into the regex engine,
>> >> but it ran into some issues. We probably want to let the toker handle
>> >> \Q \E, as otherwise (?{ ... }) gets really tricky. However in order to
>> >> support \Q "properly" the toker must know about the \U, \L and \E's as
>> >> well.
>> >
>> > \Q in the tokeniser, \L etc elsewhere, and both sharing \E troubles me.
>> > This may be unfounded on my part. I'd also hate for qr// and "" to
>> > diverge further - ie do double quoted strings need a similar imposition
>> > of sanity?
>>
>> I think if we are going to change it we are going to have to let the
>> regex engine handle \Q as well. I think Dave's recent work (see
>> comment in this thread) may make things easier.
>
> Yes. In that I think conceptually \Q belongs with \U and \L (and \F) as
> they all fight over \E
>
> What's bugging me somewhat is that if \Q, \U, \L (and \F) (and \l and \u)
> move to be implemented the regex engine, then that seems to be code
> duplication, as the same escapes have to be implemented in the tokeniser
> for "".

This is already the case for pretty much all escapes. They have to be
implemented in both because even if the toker DOES handle them, one
can synthesize the escape anyway, and then feed it to the regex
engine.  So for instance even when the toker handled \x{} the regex
engine also had to handle it. I think that it makes sense to include
the casemod escapes as well.

This should work just fine:

   my $esc="\\";
   my $hex="DF";
   print "yay" if "\x{DF}" =~ /$esc\{$hex\}/;

> Which is making me wonder, instead, whether it works to move \X and \N
> (back) to the tokeniser for regular expressions, provided that the (internal)
> regular expression API has become flexible enough (not that it takes a
> list of things, not a simple scalar) to be passed "literal string"
> (which would effectively be used for \X, \N and I guess as it would now
> exist, the guts of \Q...\E)

I do not think it does, mostly for the same reason I mention above.

> But, was there one reason or two reasons that \X and \N moved from the
> tokeniser? One IIRC was to stop /qr\x{2E}/ being the same as qr/./

Afaik this was the reason. Except it wasnt a matter of /moving/ \x{}
to the regex engine, it was of *removing* it from the toker for
regexes. Again for the pattern synthesis example.

> But was there a second related to some values for \N{} mapping to multiple
> code points, which can't be expressed well in a string passed from toke.c
> to regcomp.c ?

I think you have it sort of backwards. Moving \N{} handling completely
out of the toker for regexes meant that when concatenating a pattern
using an \N{} with a lexically scoped definition there  would be
trouble when concatenating that regex with something else in a
different scope where the \N{} had no meaning. I believe we went with
the strategy that the toker would essentially convert the \N{} with a
lexically scoped name into a special construct that had a constant
value. Then the regex engine would "see" the converted value, and
later on when it was concatenated do the right thing.

>
>> >> Here are the rules:
>> >>
>> >> \U and \L case-modify non-casemod text until the end of the string or
>> >> the next relevant encountered \E, if there is already an unterminated
>> >> \L or \U in effect then the new \U or \L will end any still in effect
>> >> casemodifiers (note: this is not a typo \Q does not end any previous
>> >> \Q \U or \L, but \U and \L do end any previous \Q).
>> >>
>> >> $ perl -le'print "\U[one]\Q[TWO]\L[THREE]"'
>> >> [ONE]\[TWO\][three]
>> >> $ perl -le'print "\Q[one]\U[TWO]\L[THREE]"'
>> >> \[one\]\[TWO\]\[three\]
>> >
>> > I think you misdescribed just the the parenthesised note, as your text
>> > contradicts your second example. \U and \L terminate any previous \U or \L,
>> > acting as an implicit \E at that point.
>>
>> I don't think the second example contradicts my parenthesised note.
>> The point was that the \L does terminate the \U but does not terminate
>> the \Q, because the \Q is *in front* of the \U. However the first
>> example shows how the \L terminates both the \U and \Q because the \Q
>> is after the \U.
>>
>> I was trying to say that \U and \L terminate any previous \U or \L
>> should they exist and when they do also terminates *anything* that
>> came after the previous \L or \U.
>
> Yes, that's what I missed. I guess this means that your phrasing was
> ambiguous (given that I was able to mis-interpret it).

Indeed. Speaking clearly about this stuff is hard. Describing madness
rationally is difficult :-).

>Hence maybe change it to:
>
>  (note: this is not a typo \Q does not end any previous \Q \U or \L, but
>  \U and \L do end any previous \Q if they are also ending a previous \L
>  or \U)
>
> although my wording feels clunky to me.

This is an improvement tho. Thanks.

> That also feels somewhat like an implementation bug. Purposefully without
> looking at the code, that sounds like \U or \L save the current state,
> and *if* \U or \L is active, such that the next \U or \L needs to act as
> an implicit \E, then it restores state, where that "state" includes whether
> \Q is active. (Instead of just restoring the state of casefolding)

Having looked at the code I can say that your description is pretty
spot on. I might quibble as to what sort of bug it was, I dont think
it was an oversight, more like a deliberate side-effect of the chosen
implementation. Consider what /\Ux\LY/ shoudl produce:

  uc("x") . lc("Y");

If we put a \Q in between /\Ux\Q \LY/ we get this:

  uc("x" . quotemeta(" ")) . lc("Y");

If we dont terminate the quotemeta then we would end up with this:

  uc("x" . quotemeta(" ". lc("Y")))

which would not lc() the "Y" (because of the uc).

> This just doesn't seem sane. Either \U, \L (and \F) should work orthogonally
> to \Q, or any of the three (four) should terminate the others.

I was thinking of trying to make it so that we would produce:

  uc("x". quotemeta(" ")) . lc(quotemeta("Y"))

Effectively require an \E to finish a \Q.

BTW, this reminds me about the whole warning on useless use of \E. If
we are going to warn about that should we not also warn about useless
use of \l and \u? For instance in /\Ufoo \lbar/ the \l is useless, we
will evaluate something like this:

  uc("foo " . lcfirst("bar"))

So if we /are/ going to warn on \E then we probably should warn on
useless use of \l and \u too.

>> >> \l and \u case-modify the next non-casemod text in the string, or
>> >> nothing if there is no non-casemod text in between it and then next \U
>> >> \Q \L or \E. Any preceding \L or \U take precedence, except in the
>> >> case where the \l or \u immediately follow an \L or \U in which case
>> >> the \l or \u take precedence.
>> >> $ perl -le'print "\lFOO"'
>> >> fOO
>> >> $ perl -le'print "\ufoo"'
>> >> Foo
>> >> $ perl -le'print "\L\ufoo \ubar"'
>> >> Foo bar
>> >> $ perl -le'print "\U\lfoo \lfoo"'
>> >> fOO FOO
>> >
>> > That \L or \U take precedence is, um, strange, counter intuitive and less
>> > useful than the other way round would be.
>>
>> And to make it more fun Ill repeat what I said to Tom C:
>>
>> \U\l and \L\u are both special cased to be parsed as  \l\U and \u\L.
>
> That then makes a lot more sense, coupled with the "explain how it maps
> to uc(), lcfirst() etc".

Yes indeed. But it is also a special case.

I could imagine different rules where this would not have to be a
special case. Like the rules that would produce this:

/\Ufoo \lbar/       => upper("foo ") . lcfirst(upper("bar"))
/\Ux\Q \LY/        => upper("x" . quotemeta(" ") ) . lower(quotemeta("Y"))
/\Ux\Q \E\LY/     => upper("x" . quotemeta(" ") ) . lower("Y")
/\Ux\Q \lP\E\LY/ => upper("x" . quotemeta(" ") ) .
lcfirst(upper(quotemeta("P"))) . lower(quotemeta("Y"))

Maybe something like this:

Case modifiers are divided into three groups:

\Q    => "non-case" modifiers quotemeta
\U \L => "inner" case modifiers: uc(), lc()
\u \l  => "outer" case modifiers: ucfirst(), lcfirst()

At any one time zero, or one of each may be active at each time.
If more than one is active at any one time, then they are applied in
order of 'non-case', 'inner', 'outer' [ that is:
lcfirst(uc(quotemeta($x))) ]
\Q \U \L are terminated by a nesting \E.
\U and \L override each other, but stack.
\Q stacks with \U and \L.
\l and \u override each other and do not stack.

So for instance:

"\Ux\Q \LX\E y\lzx\E z\Et"

Would evaluate to: upper("x") . quotemeta(" "). lower(quotemeta("X"))
. upper(quotemeta(" y")) . lcfirst(upper(quotemeta("zx"))) . upper("
z") . "t"

Something like that anyway.

cheers,
Yves








-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About