develooper Front page | perl.perl5.porters | Postings from November 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
November 13, 2008 16:20
Subject:
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
3526.1226621983@chthon
Replying to Chip Salzenberg's message of "Wed, 12 Nov 2008 18:18:57 PST"
and to Karl Williamson's of "Thu, 13 Nov 2008 11:38:48 MST":

SUMMARY: 

 *  There exist in octal character notation both implementation bugs as
    well as built-in, by-design bugs, particular when used in regular 
    expressions.

 *  A few of these we've brought on ourselves, because we relaxed the
    octal-char definition in ways that they designers of these things
    never did, and so some of our troubles with them are our own fault.

 *  The implementation bugs we can fix, if we're careful and consistent,
    but design bugs we cannot.

 *  Nor can we eliminate the notation altogether, due to the existing
    massive code base that relies upon it.

 *  The best we can do is generate, under certain circumstances, 
    a warning related to an ambiguous \XXX being interpreted as 
    either a backreference or a character.

That's probably as far as many people may care to read, and that's fine.

However, I do provide new info below that comes straight from the horse's
mouth about the historical ambiguity--and I mean those horses once stabled
at Murray Hill, not at JPL.

First, what came before:

:rafael> I don't think it's worth changing the meaning of \400 in
:rafael> double quoted strings, or making it warn. However, in
:rafael> regexps, it's too dangerously inconsistent and should be
:rafael> deprecated. First, a deprecation warning seems in order.

:rafael> However, I see some value in still allowing [\000-\377]
:rafael> character ranges, for example. Do we really want to
:rafael> deprecate that as well? This doesn't seem necessary.

:yves>> Consider /\1/ means the first capture buffer of the previous
:yves>> match, \17 means the _seventeenth_ capture buffer of the
:yves>> previous match IFF the previous match contains more 17 or
:yves>> more capture buffers, otherwise it means \x{F}.

:yves>> In short: resolving the inconsistencies in octal notation in
:yves>> regex notation would appear to be impossible.

:rafael> Error messages are a mess, too. This one is correct:
:rafael>     $ perl -wE '/\8/'
:rafael>     Reference to nonexistent group in regex; marked by <-- HERE in m/\8
:rafael>     <-- HERE / at -e line 1.

:rafael> This one shows clearly that we're using a regexp that matches
:rafael> "\x{1}8", but why is there a duplicated warning? Double magic?

:rafael>     $ perl -wE '/\18/'
:rafael>     Illegal octal digit '8' ignored at -e line 1.
:rafael>     Illegal octal digit '8' ignored at -e line 1.

And also:

In-Reply-To: Chip's of "Wed, 12 Nov 2008 18:18:57 PST."
             <20081113021857.GJ2062@tytlal.topaz.cx>

glenn>>> The [below] items could be added to the language immediately,
glenn>>> during the deprecation cycle for \nnn octal notation [...]

tchrist>> I find the notion of rendering illegal the existing octal
tchrist>> syntax of "\33" is an *EXTRAĂ–RDINARILY* bad idea, a position I
tchrist>> am prepared to defend at laborious length--and, if necessary,
tchrist>> appeal to the Decider-in-Chief [...]

chip> I am happy to mark my return to p5p by singing in harmony with
chip> Tom C.

chip> Perl's octal escapes are of venerable origin, coming as they do
chip> from C -- not the newfangled ANSI and ISO dialects, let alone
chip> Bjarne's heresy, but the earliest and purest syntax, which sprang
chip> fully-formed from Ken's, Brian's and Dennis's foreheads.  Breaking
chip> octal escapes would piss off lots of people, and break lots of
chip> code, for no sufficiently valuable purpose.

I'm at USENIX right now, and while Ken and Dennis aren't here, Andrew Hume
*is*.  Andrew long worked in the fabled research group group there at
Murray Hill, along with Brian and Rob and the rest of that seminal crew who
charted much of this out.  Andrew wrote the first Plan9 grep program,
gre(1), which was interesting because it internally broke up the pattern
into nice DFA parts and unnice backtracking parts and attacked them
separately. Rob and Ken later wrote purely DFA versions (no backtracking,
no backreferencing) when they added UTF-8 support.

So absent Ken, Andrew is probably the next best to ask this of, as he
is particularly well-versed with regexes in all aspects: historical,
current, standardized, etc.  It's he whom we refer to in the Camel's
pattern-matching section when we write in the footnote:

    It has been said(*) that programs that write programs 
    are the happiest programs in the world.

	* By Andrew Hume, the famous Unix philosopher.

I've just come from speaking with Andrew about all this, of whom 
I posed the question:

 1.  *Why* did Ken (et alios) select the same notation for 
     backreferences in regex as for octally notated characters?

 2.  *How* did you guys cope with its inherent ambiguity?

Andrew said that they coped with the ambiguity mostly by always requiring
exactly 3 octal digits for characters.  You couldn't say \33 for ESC; you
had to say \033. If someone wanted the third capture buffer, they  write
\3; if they wanted ETX, they wrote \003; \3 and \003 therefore *meant*
*different* *things* in regexes, and the first was disallowed in strings.

Andrew admits that this is not a perfect fix, as a theoretical hole
remains, but he asserts that in practice, forcing 3-digits for octal chars
covered *almost* all the real-world situations where ambiguity might in
practice raise its heisenhead. Although the early pattern-matchers only had
\1 .. \9 for captures, later ones dispensed with this restriction. But
still the 3-digit rule seemed safe enough.

So that hole was deemed small enough, and also infrequent and unlikely 
(at least in in non-program-generated programs) that Ken&Co. just lived 
with it, preferring clarity and brevity (simple to read and write) over a 
more complex yet bullet-proof notation.

Andrew said, sure, it's a bit messy, or untidy, but if you're looking for
pristine perfection, you're looking for the wrong thing.  Or something
like that.  

The only exception to this was \0, which saw frequent enough use that making
folks always specify \000 to mean NUL was deemed unduly onerous.  Also,
the original pattern-matchers didn't handle nulls, plus some of them
treated \0 as "the whole match", much as we now use (?0) to recurse on
the whole pattern.  

One last thing: Andrew, upon being told about the TRIE regex optimization,
suggests we might look into splay trees for this instead.  He thinks they
have properties that might make them even faster/smaller, but says we'd
have to benchmark the two carefully, because it was just an informed hunch.

Now Henry isn't here, so I can't ask him about the source of his that Larry
long ago started out from.  Important aspects of that include that Henry
admitted only \1 .. \9 for backrefs *AND* how the 3-digit octal-character
backslash escapes shall have already been processed by the time the regex
compiler has to think about things.  That means it didn't have to think
about both. This is somewhat how \U is handled during variable
interpolation, not by the regex compiler.

Some of the Spencerian sources and derivatives are available at

    http://arglist.com/regex/

Some can be quite, educative.

One this I found especially amusing was this change log comment:

    Fix for a serious bug that affected REs using many [] (including
    REG_ICASE REs because of the way they are implemented), *sometimes*,
    depending on memory-allocation patterns.

Sound familiar, anybody :-)  [HINT: think of /(\337)\1/i ]

You can look up more on the history of regexes, from Ken's original
1968 paper to Rob and Ken's 1992 specking out of UTF-8, at:

    http://swtch.com/~rsc/regexp/

Historical sources of interest here include

    Ken's original paper to CACM, 4 dense pages:
	http://doi.acm.org/10.1145/363347.363387

    Ken's UTF-8 version of grep, w/o backtracking:
	http://swtch.com/usr/local/plan9/src/cmd/grep/

    Rob's regexp (no backtracking) library that handles UTF-8:
	http://swtch.com/plan9port/unix/
      Its section 3 manpage:
	http://swtch.com/plan9port/unix/man/regexp93.html
      Its section 7 manpage:
	http://swtch.com/plan9port/unix/man/regexp97.html
      Its code:
	http://swtch.com/plan9port/unix/libregexp9.tgz

    Rob's paper on Structured Regular Expressions:
	http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf

    Rob's "sam" editor
	http://netlib.bell-labs.com/sys/doc/sam/sam.html

    Code to implement Perl's regexp rules:
	http://swtch.com/~rsc/regexp/nfa-perl.y.txt

One thing I found amusing in Rob's sam paper was:

    The regular expression code in sam is an interpreted, rather than
    compiled on-the-fly, implementation of Thompson's non-deterministic
    finite automaton algorithm.[12] The syntax and semantics of the
    expressions are as in the UNIX program egrep, including alternation,
    closures, character classes, and so on. The only changes in the
    notation are two additions: \n is translated to, and matches, a newline
    character, and @ matches any character. In egrep, the character .
    matches any character except newline, and in sam the same rule seemed
    safest, to prevent idioms like .* from spanning newlines.  Egrep
    expressions are arguably too complicated for an interactive editor --
    certainly it would make sense if all the special characters were two-
    character sequences, so that most of the punctuation characters
    wouldn't have peculiar meanings -- but for an interesting command
    language, full regular expressions are necessary, and egrep defines the
    full regular expression syntax for UNIX programs.  Also, it seemed
    superfluous to define a new syntax, since various UNIX programs (ed,
    egrep and vi) define too many already.

There's a bunch going on with standardization, widechars, utf-8, etc, right
now. If only UTF-8 had been around earlier ("What, 1992 isn't early
enough?"), a lot of trouble would have been averted.  That Perl settled on
UTF-8 internally early on was applauded by the Association's current
standards rep as clearly the right way to go.

It's really sad that it looks like the C std committee look to be going to
accept Microsoft's char16 datatype for wide characters.  This locks you
into UCS-2/UTF-16, whihc means surrogates to get off the primary plane, and
a very long/bad recovery if you get poke your head in the wrong place in
the stream.  This is going to make problems for people.  Java has the
problem.  EXIF has the problem.

And now on to Karl's message.

karl> yves wrote:

yves>> 2008/11/13 Tom Christiansen <tchrist@perl.com>:

glenn>>>> My understanding is that in a regex, if you have 3 matches,
glenn>>>> that "\333" might be more ambiguous than you are assuming.

It could mean

    \g{3} followed by "33"
    ubyte 219: "@{ [pack C => 219] }"
    uchar 219: "@{ [pack U => 219] }"

tchrist>>>>> There is GREAT reason *not* to delete it, as the quantity of
tchrist>>>>> code you would see casually rendered illegal is
tchrist>>>>> incomprehensibly large, with the work involved in updating
tchrist>>>>> code, databases, config files, and educating programmers and
tchrist>>>>> users incalculably great.  To add insult to injury, this work
tchrist>>>>> you would see thrust upon others, not taken on yourself.

glenn>>>> Yep, that's a great reason.

tchrist>>> I'm glad you agree, easily suffices to shut off the rathole.

tchrist>>> And in case it doesn't, the output below will convince anyone
tchrist>>> that we ***CANNOT*** remove \0ctal notation.  Larry would never
tchrist>>> allow you to break so many people's code.  It would be the worst
tchrist>>> thing Perl has ever done to its users. It verges upon the
tchrist>>> insane.

yves>> First please separate what Glenn said from what Rafael and I said,
yves>> which is that it might be a good idea to deprecate octal IN REGULAR
yves>> EXPRESSIONS.

I believe I have now done as you have asked.  It's useful for more
than just accuracy of attribution, too.

yves>> I spoke perhaps more harshly than I meant originally, which
yves>> is what kicked this off. I should have said "strongly
yves>> discouraged" and not "deprecated".

It was indeed Glenn's suggestion that these first be deprecated 
and then in the release following, AND THEN REMOVED ALTOGETHER, 
that I found to be utterly untenable. 

His later messages seem to say that he was just trying to fly a strawman to
see how far he could push it just to test the boundaries via hypotheticals.
If so, that seems to say it wasn't an honest suggestion made in good faith,
just something there to "stir the bucket" (or the hornets' nest).  Perhaps
he finds this useful as a general principle; but here, I do not.

yves>> Obviously from a back compat viewpoint we can't actually
yves>> remove octal completely FROM THE REGEX ENGINE. At the very
yves>> least there is a large amount of code that either generates
yves>> octal sequences or contains them IN REGULAR EXPRESSSIONS.

You say "obviously", and I think it obvious, too, but either Glenn
advocate did not or was not arguing in good faith, only secretly
playing devil's  advocate.  That's far too complicated for me.

I take what people say for what they mean and vice versa, without
attempting doublethink, triplethink, etc.  It's not my strength, and
it's waste of time to try to figure out what people mean in case they
are intentionally saying things they DON'T mean without labelling
those statements as clearly of that nature.

I don't appreciate it, and that is the very most courteous way I 
can think of expressing a sentiment I have plenty of less courteous
words for.

yves>> But we sure can say n the docs that "it is recommended that
yves>> you do not use octal in regular expressions in new code as it
yves>> is ambiguous as to how they will be interpreted, especially
yves>> low value octal (excepting \0) can easily be mistaken for a
yves>> backreference".

It seems that we got into trouble by allowing one- and two-digit
octal character escapes.  Tbhis is not something that the
original designers (Ken; Dennis and Brian; Rob) ever did, and
thereby circumvented much of our trouble.

Perhaps what should happen is that we should encourage 3-digit octal
notation only.

tchrist>> Grepping for \\\d *ONLY* in the indented code segments of
tchrist>> the standard pods:

yves>> Oh cmon! You of all people must know a whole whack of ways to
yves>> count them. 

I did this because I'd taken Gless at his literal word, and I
wanted everyone to see how extensive this use was.  I also wanted
to demonstrate the historical difference that seems to have
cropped up as we went from C programmers as our main programmer
base, to non-C-programmers.  This meant that we started to get 1-
and 2-digit octal escapes where we'd never before had them.

yves>> The list also is a bit off-topic* as very few of those are
yves>> actually in regular expressions, and amusingly the second
yves>> item in your list isn't octal. Illustrating the problem
yves>> nicely.

I was perfectly aware it was a reference.  I didn't dump the data
on you dumbly.  I could have summarized it, described trends, but
this doesn't have the impact of seeing the raw data, which is
what I was aiming for to bat down the crazy idea of forcing
uncountably many broken programs.  Having to change my code due
to a Perl upgrade thrice in 21 years is nothing like what Glenn
feigned contemplating.

yves>> Personally I dislike ambiguous syntax 

As do I.  Larry is actually a lot more comfortable with it than 
I am, because he realizes due to his work with natural language that
humans are good with ambiguity and that one can, if one is clever enough,
use surrounding clues to figure out what was meant.

yves>> and think it should in general be avoided, and that maybe we
yves>> should do something to make it easier to see when there is
yves>> ambiguous syntax.

That seems pretty reasonable, too.

yves>> And I especially dislike ambiguous syntax that can be made to
yves>> change meaning by action at a distance. If I concatenate a
yves>> pattern that contains an octal sequence to a pattern that
yves>> contains a bunch of capture buffers the meaning of the "octal"
yves>> changes. That is bad.

Yes, it is bad, but there are worse problems.  You can't do in a general
and useful way do it at all, because which capture buffer means what is
going to renumber.  The new \g{-1} helps a good bit here, as does
\g{BUFNAME}, but it's still a sticky problem requiring more overall
knowledge than you'd like it to require.

yves>> Assuming that grok_oct() consumes at most 3 octal digits, I think
yves>> we can apply Karls patch. However I do think we should recommend
yves>> against using octal IN REGULAR EXPRESSIONS. And should note that
yves>> while you CAN use octal to represent codepoints up to 511 it is
yves>> strongly recommended that you don't.

I'd like to see three-digit octal always mean an 8-bit character, and
discourage things like \3 and \33.  I don't think we should bother
extending octal to allow for code points above "\377". That it "works"
at all there is a problem.

I included the older code because you'll see a pattern in it.
For example:

    scripts/badman: grep(/[^\001]+\001[^\001]+\001${ext}\001/ || /[^\001]+${ext}\001/,
    scripts/badman: if ( /^([^\001]*)\002/ || /^([^\002]*)\001/ )  {
    scripts/badman: if (/\001/) {
    scripts/badman: if ($last eq "\033") {
    scripts/badman: last if $idx_topic eq "\004";
    scripts/badman: last if $idx_topic eq "\004" || $idx_topic eq '0';
    scripts/badman: s/\033\+/\001/;
    scripts/badman: s/\033\,/\002/;
    scripts/badman: @tmplist = split(/\002/, $entry);
    scripts/badman: $winsize = "\0" x 8;

Notice someting?  Being once-and-always a C programmer at heart, by habit
I always used to use a single digit for a NUL *only* and 3 digits otherwise.
The Camel's use of

    camel3-examples:if ( ("fred" & "\1\2\3\4") =~ /[^\0]/ ) { ... }

is just something I do *not* like.  It does no good to warn in
strings, where there are no backrefs (save in s///), but I'm not
sure the level of warning appropriate for regexes.  

Speaking of which, isn't it time to tell the s/(...)/\1\1/g people 
to get their acts together?

yves>> Also I have a concern that Karls patch merely modifies the
yves>> behaviour in the regular expression engine. It doesn't do the
yves>> same for other strings. If it is going to be legal it should
yves>> be legal everywhere.

Yes, this is a real issue, the first one I raised.

karl> grok_oct() itself consumes as many octal digits as there are
karl> in its parameter, as long as the result doesn't overflow a UV.
karl> It is used for general purpose octal conversion, such as from
karl> the oct() function.

Hm.

The oct() function is already bizarre enough.  It's weird that it
takes bits of any sort, converts them to decimal, then treats that
decimal as octal, and returns a string of new digits.  Even calling
it dec2oct might have helped.

    oct("0755")
    oct("755")

ok, but 

    oct(755)
    oct(700 + 50 + 5)

doing the same thing is, well, just not what people are thinking it does.

karl> Tom has pointed out that \777 is a reserved value in some
karl> contexts.

It's the $/ issue.

karl> It seems to me to be a bad idea to remove acceptance of octal
karl> numbers in re's.

And to me.  Remember, I'm the one who gets testy about 

    $h{date} = "foo";
    $h{time} = "bar";

    $h{time()} = "oh, right";

    $h{033} = "foo";
    $h{27} .= "bar";   # now foobar

    $h{"033"} = "not again";

karl> It seems like a good idea to add something to the language so
karl> one can express them unambiguously.  Even I with my limited
karl> knowledge of regcomp.c could do it easily (fools rush in...).

I could live with \o{...} if I had to, but I'm nervous where it goes.

karl> And it seems like an even better idea to handle them
karl> consistently. I see two ways to do that 1) accept my patch; or
karl> 2) forbid or warn about the use of those larger than a single
karl> character in the machine architecture in both strings and
karl> re's, including char classes.  Perhaps I've forgotten something in
karl> this thread.  If so, I'm sorry.

Karl, you've nothing to be sorry about.  Your courtesy, 
conscientiousness, and can-do attitude are very welcome.

--tom

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About