develooper Front page | perl.perl5.porters | Postings from November 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
From:
demerphq
Date:
November 14, 2008 15:23
Subject:
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
9b18b3110811141523y2e47219fofc9e1aa61646e89b@mail.gmail.com
2008/11/14 Tom Christiansen <tchrist@perl.com>:
> Replying to Chip Salzenberg's message of "Wed, 12 Nov 2008 18:18:57 PST"
> and to Karl Williamson's of "Thu, 13 Nov 2008 11:38:48 MST":
>
> SUMMARY:
>
>  *  There exist in octal character notation both implementation bugs as
>    well as built-in, by-design bugs, particular when used in regular
>    expressions.
>
>  *  A few of these we've brought on ourselves, because we relaxed the
>    octal-char definition in ways that they designers of these things
>    never did, and so some of our troubles with them are our own fault.
>
>  *  The implementation bugs we can fix, if we're careful and consistent,
>    but design bugs we cannot.
>
>  *  Nor can we eliminate the notation altogether, due to the existing
>    massive code base that relies upon it.

Yes, this is absolutely clear (now). I misspoke when I suggested this.

>  *  The best we can do is generate, under certain circumstances,
>    a warning related to an ambiguous \XXX being interpreted as
>    either a backreference or a character.

As you have said, \g{} makes these much less important.

>
> That's probably as far as many people may care to read, and that's fine.
>
> However, I do provide new info below that comes straight from the horse's
> mouth about the historical ambiguity--and I mean those horses once stabled
> at Murray Hill, not at JPL.
>
> First, what came before:
>
> :rafael> I don't think it's worth changing the meaning of \400 in
> :rafael> double quoted strings, or making it warn. However, in
> :rafael> regexps, it's too dangerously inconsistent and should be
> :rafael> deprecated. First, a deprecation warning seems in order.
>
> :rafael> However, I see some value in still allowing [\000-\377]
> :rafael> character ranges, for example. Do we really want to
> :rafael> deprecate that as well? This doesn't seem necessary.
>
> :yves>> Consider /\1/ means the first capture buffer of the previous
> :yves>> match, \17 means the _seventeenth_ capture buffer of the
> :yves>> previous match IFF the previous match contains more 17 or
> :yves>> more capture buffers, otherwise it means \x{F}.

I misspoke here too. Backrefs are to captures in the current pattern
not the previous.

Meaning that the real danger is that when one concatenates patterns,
or programmatically manipulates them which wasnt a problem for grep or
editors or what not. But is a problem when regex'es become integrated
into the language as tightly as they are in perl.

[snip]
> :rafael> This one shows clearly that we're using a regexp that matches
> :rafael> "\x{1}8", but why is there a duplicated warning? Double magic?
>
> :rafael>     $ perl -wE '/\18/'
> :rafael>     Illegal octal digit '8' ignored at -e line 1.
> :rafael>     Illegal octal digit '8' ignored at -e line 1.

The double warning comes because Perl does two passes, and if the
pattern was /\18\x{100}/ maybe even three.

And each time we try to grok_oct() on the same sequence and so
generate the same warning. Anyway, its a bug that needs to be fixed.
Sigh.

> And also:
>
> In-Reply-To: Chip's of "Wed, 12 Nov 2008 18:18:57 PST."
>             <20081113021857.GJ2062@tytlal.topaz.cx>
>
> glenn>>> The [below] items could be added to the language immediately,
> glenn>>> during the deprecation cycle for \nnn octal notation [...]
>
> tchrist>> I find the notion of rendering illegal the existing octal
> tchrist>> syntax of "\33" is an *EXTRAĂ–RDINARILY* bad idea, a position I
> tchrist>> am prepared to defend at laborious length--and, if necessary,
> tchrist>> appeal to the Decider-in-Chief [...]
>
> chip> I am happy to mark my return to p5p by singing in harmony with
> chip> Tom C.
>
> chip> Perl's octal escapes are of venerable origin, coming as they do
> chip> from C -- not the newfangled ANSI and ISO dialects, let alone
> chip> Bjarne's heresy, but the earliest and purest syntax, which sprang
> chip> fully-formed from Ken's, Brian's and Dennis's foreheads.  Breaking
> chip> octal escapes would piss off lots of people, and break lots of
> chip> code, for no sufficiently valuable purpose.

Don't worry both of you. Just pointing out how much could break
snapped some sense into my head. Mea-culpa and all that.

> I'm at USENIX right now, and while Ken and Dennis aren't here, Andrew Hume
> *is*.  Andrew long worked in the fabled research group group there at
> Murray Hill, along with Brian and Rob and the rest of that seminal crew who
> charted much of this out.  Andrew wrote the first Plan9 grep program,
> gre(1), which was interesting because it internally broke up the pattern
> into nice DFA parts and unnice backtracking parts and attacked them
> separately. Rob and Ken later wrote purely DFA versions (no backtracking,
> no backreferencing) when they added UTF-8 support.

I'll have to take a look at gre as it sounds like it is right along
the lines of what we need. Afaui we can't go to full DFA construction
in perl, at least not for every pattern, simply because our patterns
support recursive constructs, which afaik cannot be represented as
DFA's.

> So absent Ken, Andrew is probably the next best to ask this of, as he
> is particularly well-versed with regexes in all aspects: historical,
> current, standardized, etc.  It's he whom we refer to in the Camel's
> pattern-matching section when we write in the footnote:
>
>    It has been said(*) that programs that write programs
>    are the happiest programs in the world.
>
>        * By Andrew Hume, the famous Unix philosopher.

Its interesting you quote that as its primarily when programs are
writing other programs that the octal/backref problem occurs. There is
no ambiguity in octal/backrefs in static patterns, a given escape
sequence is either one or the other. But when you concatenate two
patterns together....

[snip]
> So that hole was deemed small enough, and also infrequent and unlikely
> (at least in in non-program-generated programs) that Ken&Co. just lived
> with it, preferring clarity and brevity (simple to read and write) over a
> more complex yet bullet-proof notation.
>
> Andrew said, sure, it's a bit messy, or untidy, but if you're looking for
> pristine perfection, you're looking for the wrong thing.  Or something
> like that.

Especially in Perl. :-)

> The only exception to this was \0, which saw frequent enough use that making
> folks always specify \000 to mean NUL was deemed unduly onerous.  Also,
> the original pattern-matchers didn't handle nulls, plus some of them
> treated \0 as "the whole match", much as we now use (?0) to recurse on
> the whole pattern.
>
> One last thing: Andrew, upon being told about the TRIE regex optimization,
> suggests we might look into splay trees for this instead.  He thinks they
> have properties that might make them even faster/smaller, but says we'd
> have to benchmark the two carefully, because it was just an informed hunch.

Hmm, maybe its worth researching into that a bit. The trie logic could
definitely be improved. We use compressed transitions tables when we
probably shouldn't. Making each transition significantly more
expensive than it should be  -- mostly because of the concern of
unicode being able to make the number of transitions grow explosively
large.

> Now Henry isn't here, so I can't ask him about the source of his that Larry
> long ago started out from.  Important aspects of that include that Henry
> admitted only \1 .. \9 for backrefs *AND* how the 3-digit octal-character
> backslash escapes shall have already been processed by the time the regex
> compiler has to think about things.  That means it didn't have to think
> about both. This is somewhat how \U is handled during variable
> interpolation, not by the regex compiler.
>
> Some of the Spencerian sources and derivatives are available at
>
>    http://arglist.com/regex/
>
> Some can be quite, educative.
>
> One this I found especially amusing was this change log comment:
>
>    Fix for a serious bug that affected REs using many [] (including
>    REG_ICASE REs because of the way they are implemented), *sometimes*,
>    depending on memory-allocation patterns.
>
> Sound familiar, anybody :-)  [HINT: think of /(\337)\1/i ]

I'm probably too stupid to get this one. Feel up to spelling it out to
me offlist?

> You can look up more on the history of regexes, from Ken's original
> 1968 paper to Rob and Ken's 1992 specking out of UTF-8, at:
>
>    http://swtch.com/~rsc/regexp/
>
> Historical sources of interest here include
>
>    Ken's original paper to CACM, 4 dense pages:
>        http://doi.acm.org/10.1145/363347.363387
>
>    Ken's UTF-8 version of grep, w/o backtracking:
>        http://swtch.com/usr/local/plan9/src/cmd/grep/
>
>    Rob's regexp (no backtracking) library that handles UTF-8:
>        http://swtch.com/plan9port/unix/
>      Its section 3 manpage:
>        http://swtch.com/plan9port/unix/man/regexp93.html
>      Its section 7 manpage:
>        http://swtch.com/plan9port/unix/man/regexp97.html
>      Its code:
>        http://swtch.com/plan9port/unix/libregexp9.tgz
>
>    Rob's paper on Structured Regular Expressions:
>        http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf
>
>    Rob's "sam" editor
>        http://netlib.bell-labs.com/sys/doc/sam/sam.html
>
>    Code to implement Perl's regexp rules:
>        http://swtch.com/~rsc/regexp/nfa-perl.y.txt

Sigh. So much to learn. So little time. The latter sounds interesting,
I haven't looked but i wonder how it handles recursive patterns.

[snip]
> tchrist>>> And in case it doesn't, the output below will convince anyone
> tchrist>>> that we ***CANNOT*** remove \0ctal notation.  Larry would never
> tchrist>>> allow you to break so many people's code.  It would be the worst
> tchrist>>> thing Perl has ever done to its users. It verges upon the
> tchrist>>> insane.
>
> yves>> First please separate what Glenn said from what Rafael and I said,
> yves>> which is that it might be a good idea to deprecate octal IN REGULAR
> yves>> EXPRESSIONS.

I apologise for the shouting.

[snip]
> yves>> Obviously from a back compat viewpoint we can't actually
> yves>> remove octal completely FROM THE REGEX ENGINE. At the very
> yves>> least there is a large amount of code that either generates
> yves>> octal sequences or contains them IN REGULAR EXPRESSSIONS.
>
> You say "obviously", and I think it obvious, too, but either Glenn
> advocate did not or was not arguing in good faith, only secretly
> playing devil's  advocate.  That's far too complicated for me.

Well at the time i made the suggestion (about the regex engine) that
we do so (in the regex engine) I was not thinking clearly. Again I
apologize.

[snip]
> yves>> But we sure can say n the docs that "it is recommended that
> yves>> you do not use octal in regular expressions in new code as it
> yves>> is ambiguous as to how they will be interpreted, especially
> yves>> low value octal (excepting \0) can easily be mistaken for a
> yves>> backreference".
>
> It seems that we got into trouble by allowing one- and two-digit
> octal character escapes.  Tbhis is not something that the
> original designers (Ken; Dennis and Brian; Rob) ever did, and
> thereby circumvented much of our trouble.
>
> Perhaps what should happen is that we should encourage 3-digit octal
> notation only.

At this point tho the main advantage of using octal at all, and the
reason it is used in many places that I have seen, seem to be brevity.
So encouraging people to use 3 digits is not really much of a gain, as
it means there is no compelling reason to use octal instead of  \xHH.

I do think that Glenn did have at least one good point in his mail, I
think he was right when he suggested that not too many of the "newer
generation" are comfortable with octal, outside perhaps *nix sysadmins
who seem to absorb it from chmod and related tools.

> tchrist>> Grepping for \\\d *ONLY* in the indented code segments of
> tchrist>> the standard pods:
>
> yves>> Oh cmon! You of all people must know a whole whack of ways to
> yves>> count them.

I was just mad because your mail was truncated by gmail. Wheras a
count along with a few selected items would have made the same point,
been shorter, and I would have known for sure that I saw the full
content of your mail. TBH I have no idea if there was any commentary
after the list. If there was I never saw it.

[snip]
> I was perfectly aware it was a reference.  I didn't dump the data
> on you dumbly.  I could have summarized it, described trends, but
> this doesn't have the impact of seeing the raw data, which is
> what I was aiming for to bat down the crazy idea of forcing
> uncountably many broken programs.  Having to change my code due
> to a Perl upgrade thrice in 21 years is nothing like what Glenn
> feigned contemplating.

Understood, but hopefully you see my point that i mention above in
this reply as well. Yes I could probably use a different mail client.
But well, it seems that there is something about mail programs that
makes them particularly hateful, and gmail seems to mostly be the
least painful option I have encountered so far. At least for my needs.

> yves>> Personally I dislike ambiguous syntax
>
> As do I.  Larry is actually a lot more comfortable with it than
> I am, because he realizes due to his work with natural language that
> humans are good with ambiguity and that one can, if one is clever enough,
> use surrounding clues to figure out what was meant.

Yes, and its one of the cool things about Perl in my book (along with
the amazingly well integrated regex features ;-)

> yves>> and think it should in general be avoided, and that maybe we
> yves>> should do something to make it easier to see when there is
> yves>> ambiguous syntax.
>
> That seems pretty reasonable, too.

Yeah I think thats where this is heading: some kind of regex lint.

> yves>> And I especially dislike ambiguous syntax that can be made to
> yves>> change meaning by action at a distance. If I concatenate a
> yves>> pattern that contains an octal sequence to a pattern that
> yves>> contains a bunch of capture buffers the meaning of the "octal"
> yves>> changes. That is bad.
>
> Yes, it is bad, but there are worse problems.  You can't do in a general
> and useful way do it at all, because which capture buffer means what is
> going to renumber.  The new \g{-1} helps a good bit here, as does
> \g{BUFNAME}, but it's still a sticky problem requiring more overall
> knowledge than you'd like it to require.

Just for the record: the \g{} syntax was added to make it possible to
safely use backrefs in generated patterns by eliminating the ambiguity
of the old syntax, and to normalize the various capture buffer
syntaxes implemented in other languages. The .Net syntax, Python
syntax, and Java syntaxes all were/are different, (although those
implementation that use PCRE now support \g{} too :-), and despite
Perl 5.10 supporting them all the \g{} thing seemed a good idea. The
relative backref syntax was specifically added to make it easier to
construct patterns that used backrefs. \g{BUFFNAME} actually doesnt
help much, although I recall Abigail had some thoughts on how to make
it more powerful.

> yves>> Assuming that grok_oct() consumes at most 3 octal digits, I think
> yves>> we can apply Karls patch. However I do think we should recommend
> yves>> against using octal IN REGULAR EXPRESSIONS. And should note that
> yves>> while you CAN use octal to represent codepoints up to 511 it is
> yves>> strongly recommended that you don't.
>
> I'd like to see three-digit octal always mean an 8-bit character, and
> discourage things like \3 and \33.  I don't think we should bother
> extending octal to allow for code points above "\377". That it "works"
> at all there is a problem.

Well if we don't allow it then we have to forbid it. I think at this
point allowing it is the least worst option.

>
> I included the older code because you'll see a pattern in it.
> For example:
>
>    scripts/badman: grep(/[^\001]+\001[^\001]+\001${ext}\001/ || /[^\001]+${ext}\001/,
>    scripts/badman: if ( /^([^\001]*)\002/ || /^([^\002]*)\001/ )  {
>    scripts/badman: if (/\001/) {
>    scripts/badman: if ($last eq "\033") {
>    scripts/badman: last if $idx_topic eq "\004";
>    scripts/badman: last if $idx_topic eq "\004" || $idx_topic eq '0';
>    scripts/badman: s/\033\+/\001/;
>    scripts/badman: s/\033\,/\002/;
>    scripts/badman: @tmplist = split(/\002/, $entry);
>    scripts/badman: $winsize = "\0" x 8;
>
> Notice someting?  Being once-and-always a C programmer at heart, by habit
> I always used to use a single digit for a NUL *only* and 3 digits otherwise.
> The Camel's use of
>
>    camel3-examples:if ( ("fred" & "\1\2\3\4") =~ /[^\0]/ ) { ... }
>
> is just something I do *not* like.  It does no good to warn in
> strings, where there are no backrefs (save in s///), but I'm not
> sure the level of warning appropriate for regexes.
>
> Speaking of which, isn't it time to tell the s/(...)/\1\1/g people
> to get their acts together?

Probably. Do you have a list? :-)

> yves>> Also I have a concern that Karls patch merely modifies the
> yves>> behaviour in the regular expression engine. It doesn't do the
> yves>> same for other strings. If it is going to be legal it should
> yves>> be legal everywhere.
>
> Yes, this is a real issue, the first one I raised.

Well Karl suggested it is legal, and iiuir does the right thing (iow
causing the string to be upgraded).

> karl> grok_oct() itself consumes as many octal digits as there are
> karl> in its parameter, as long as the result doesn't overflow a UV.
> karl> It is used for general purpose octal conversion, such as from
> karl> the oct() function.
>
> Hm.
>
> The oct() function is already bizarre enough.  It's weird that it
> takes bits of any sort, converts them to decimal, then treats that
> decimal as octal, and returns a string of new digits.  Even calling
> it dec2oct might have helped.
>
>    oct("0755")
>    oct("755")
>
> ok, but
>
>    oct(755)
>    oct(700 + 50 + 5)
>
> doing the same thing is, well, just not what people are thinking it does.

I guess this is just a side effect of numbers and strings being
effectively interchangable.

> karl> Tom has pointed out that \777 is a reserved value in some
> karl> contexts.
>
> It's the $/ issue.

I dont understand, can you expand on that a bit?

> karl> It seems to me to be a bad idea to remove acceptance of octal
> karl> numbers in re's.
>
> And to me.  Remember, I'm the one who gets testy about
>
>    $h{date} = "foo";
>    $h{time} = "bar";
>
>    $h{time()} = "oh, right";

I had to think about this one.

>    $h{033} = "foo";
>    $h{27} .= "bar";   # now foobar
>
>    $h{"033"} = "not again";

Ah. You gotta love those bizarre passageways and secret doors in perl
dontcha? Just like one of those castles in a good horror flick.

> karl> It seems like a good idea to add something to the language so
> karl> one can express them unambiguously.  Even I with my limited
> karl> knowledge of regcomp.c could do it easily (fools rush in...).
>
> I could live with \o{...} if I had to, but I'm nervous where it goes.

I dont see any problem with this really. But given we have \x{} I dont
really see the point, but if it made the octal folks happy then so be
it.

> karl> And it seems like an even better idea to handle them
> karl> consistently. I see two ways to do that 1) accept my patch; or
> karl> 2) forbid or warn about the use of those larger than a single
> karl> character in the machine architecture in both strings and
> karl> re's, including char classes.  Perhaps I've forgotten something in
> karl> this thread.  If so, I'm sorry.
>
> Karl, you've nothing to be sorry about.  Your courtesy,
> conscientiousness, and can-do attitude are very welcome.

Definitely. And mails like this are too. Very much so.

cheers,
Yves



-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About