develooper Front page | perl.perl5.porters | Postings from January 2012

Re: \Questions about the \Future of \Escapes

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
January 13, 2012 10:31
Subject:
Re: \Questions about the \Future of \Escapes
Message ID:
20120113183125.GV9069@plum.flirble.org
On Thu, Jan 12, 2012 at 12:53:03PM +0100, demerphq wrote:
> On 11 January 2012 16:18, Nicholas Clark <nick@ccl4.org> wrote:
> >> FWIW, I got pretty far with pushing \u\U\l\L\E into the regex engine,
> >> but it ran into some issues. We probably want to let the toker handle
> >> \Q \E, as otherwise (?{ ... }) gets really tricky. However in order to
> >> support \Q "properly" the toker must know about the \U, \L and \E's as
> >> well.
> >
> > \Q in the tokeniser, \L etc elsewhere, and both sharing \E troubles me.
> > This may be unfounded on my part. I'd also hate for qr// and "" to
> > diverge further - ie do double quoted strings need a similar imposition
> > of sanity?
> 
> I think if we are going to change it we are going to have to let the
> regex engine handle \Q as well. I think Dave's recent work (see
> comment in this thread) may make things easier.

Yes. In that I think conceptually \Q belongs with \U and \L (and \F) as
they all fight over \E

What's bugging me somewhat is that if \Q, \U, \L (and \F) (and \l and \u)
move to be implemented the regex engine, then that seems to be code
duplication, as the same escapes have to be implemented in the tokeniser
for "".

Which is making me wonder, instead, whether it works to move \X and \N
(back) to the tokeniser for regular expressions, provided that the (internal)
regular expression API has become flexible enough (not that it takes a
list of things, not a simple scalar) to be passed "literal string"
(which would effectively be used for \X, \N and I guess as it would now
exist, the guts of \Q...\E)

But, was there one reason or two reasons that \X and \N moved from the
tokeniser? One IIRC was to stop /qr\x{2E}/ being the same as qr/./
But was there a second related to some values for \N{} mapping to multiple
code points, which can't be expressed well in a string passed from toke.c
to regcomp.c ?

> >> Here are the rules:
> >>
> >> \U and \L case-modify non-casemod text until the end of the string or
> >> the next relevant encountered \E, if there is already an unterminated
> >> \L or \U in effect then the new \U or \L will end any still in effect
> >> casemodifiers (note: this is not a typo \Q does not end any previous
> >> \Q \U or \L, but \U and \L do end any previous \Q).
> >>
> >> $ perl -le'print "\U[one]\Q[TWO]\L[THREE]"'
> >> [ONE]\[TWO\][three]
> >> $ perl -le'print "\Q[one]\U[TWO]\L[THREE]"'
> >> \[one\]\[TWO\]\[three\]
> >
> > I think you misdescribed just the the parenthesised note, as your text
> > contradicts your second example. \U and \L terminate any previous \U or \L,
> > acting as an implicit \E at that point.
> 
> I don't think the second example contradicts my parenthesised note.
> The point was that the \L does terminate the \U but does not terminate
> the \Q, because the \Q is *in front* of the \U. However the first
> example shows how the \L terminates both the \U and \Q because the \Q
> is after the \U.
> 
> I was trying to say that \U and \L terminate any previous \U or \L
> should they exist and when they do also terminates *anything* that
> came after the previous \L or \U.

Yes, that's what I missed. I guess this means that your phrasing was
ambiguous (given that I was able to mis-interpret it). Hence maybe change
it to:

  (note: this is not a typo \Q does not end any previous \Q \U or \L, but
  \U and \L do end any previous \Q if they are also ending a previous \L
  or \U)

although my wording feels clunky to me.

That also feels somewhat like an implementation bug. Purposefully without
looking at the code, that sounds like \U or \L save the current state,
and *if* \U or \L is active, such that the next \U or \L needs to act as
an implicit \E, then it restores state, where that "state" includes whether
\Q is active. (Instead of just restoring the state of casefolding)

This just doesn't seem sane. Either \U, \L (and \F) should work orthogonally
to \Q, or any of the three (four) should terminate the others.

> >> \l and \u case-modify the next non-casemod text in the string, or
> >> nothing if there is no non-casemod text in between it and then next \U
> >> \Q \L or \E. Any preceding \L or \U take precedence, except in the
> >> case where the \l or \u immediately follow an \L or \U in which case
> >> the \l or \u take precedence.
> >> $ perl -le'print "\lFOO"'
> >> fOO
> >> $ perl -le'print "\ufoo"'
> >> Foo
> >> $ perl -le'print "\L\ufoo \ubar"'
> >> Foo bar
> >> $ perl -le'print "\U\lfoo \lfoo"'
> >> fOO FOO
> >
> > That \L or \U take precedence is, um, strange, counter intuitive and less
> > useful than the other way round would be.
> 
> And to make it more fun Ill repeat what I said to Tom C:
> 
> \U\l and \L\u are both special cased to be parsed as  \l\U and \u\L.

That then makes a lot more sense, coupled with the "explain how it maps
to uc(), lcfirst() etc".

Nicholas Clark

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About