develooper Front page | perl.perl5.porters | Postings from November 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
Glenn Linderman
November 14, 2008 20:52
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
On approximately 11/13/2008 4:19 PM, came the following characters from 
the keyboard of Tom Christiansen:

> :rafael> This one shows clearly that we're using a regexp that matches
> :rafael> "\x{1}8", but why is there a duplicated warning? Double magic?
> :rafael>     $ perl -wE '/\18/'
> :rafael>     Illegal octal digit '8' ignored at -e line 1.
> :rafael>     Illegal octal digit '8' ignored at -e line 1.

This one confuses me: clearly \18, in this regex, is not a backref, 
because there are no captures, and clearly 8 is not an octal digit, so 
by the 1, 2, or 3 octal digits rule, the 8 should be silently ignored, 
and the expression should be equivalent to /\x{1}8/ by my reading of the 

> I've just come from speaking with Andrew about all this, of whom 
> I posed the question:
>  1.  *Why* did Ken (et alios) select the same notation for 
>      backreferences in regex as for octally notated characters?
>  2.  *How* did you guys cope with its inherent ambiguity?
> Andrew said that they coped with the ambiguity mostly by always requiring
> exactly 3 octal digits for characters.  You couldn't say \33 for ESC; you
> had to say \033. If someone wanted the third capture buffer, they  write
> \3; if they wanted ETX, they wrote \003; \3 and \003 therefore *meant*
> *different* *things* in regexes, and the first was disallowed in strings.

Apparently that was something that was done only in grep, et alia.  K&R 
clearly states 1-3 octal characters for an octal string constant.

> glenn>>>> My understanding is that in a regex, if you have 3 matches,
> glenn>>>> that "\333" might be more ambiguous than you are assuming.
> It could mean
>     \g{3} followed by "33"
>     ubyte 219: "@{ [pack C => 219] }"
>     uchar 219: "@{ [pack U => 219] }"

> It was indeed Glenn's suggestion that these first be deprecated 
> and then in the release following, AND THEN REMOVED ALTOGETHER, 
> that I found to be utterly untenable. 
> His later messages seem to say that he was just trying to fly a strawman to
> see how far he could push it just to test the boundaries via hypotheticals.
> If so, that seems to say it wasn't an honest suggestion made in good faith,
> just something there to "stir the bucket" (or the hornets' nest).  Perhaps
> he finds this useful as a general principle; but here, I do not.

It was an attempt to see if there was an acceptable solution that could 
resolve the ambiguity by removing the offending syntax; I didn't really 
think it would fly on its own, but because occasionally major versions 
to accept backward incompatible changes, I thought there might be some 
chance, given an alternative syntax that is more consistent with syntax 
  for constants in other number bases.  But I did rather expect that the 
weight of existing ambiguous code would kill the idea of removing the 
\nnn syntax; I still have some hope that a new, more useful octal syntax 
might be made available, in addition to the limited, ambiguous one. 
Then documentation could nudge people towards using the new syntax.

The fact that you have only been forced into 3 source changes by 
incompatible Perl changes may well indicate a canny intuition on your 
part as to the parts of the language that might change incompatibly, or 
perhaps contentment with a subset of the language that happens not to 
have changed, more than there not being such changes, as you imply in 
your arguments.  On the other hand, for the short few years that I've 
been following this list, it has been true that most of the incompatible 
changes I've seen accepted were in areas that were rather buggy, and 
that there was little way forward other than an incompatible change.

> I did this because I'd taken Gless at his literal word, and I
> wanted everyone to see how extensive this use was.  I also wanted
> to demonstrate the historical difference that seems to have
> cropped up as we went from C programmers as our main programmer
> base, to non-C-programmers.  This meant that we started to get 1-
> and 2-digit octal escapes where we'd never before had them.

You're rather confused here; as stated above, C has accepted one, two 
and three digit octal escapes from the beginning.  Your quotes from 
Andrew regarding \0 and three-digit escapes only are clearly referring 
to other programs, such as grep, and perhaps egrep and awk?  I'm not 
going to take the time to research what programs Andrew was referring 
to, if you didn't clarify that in your discussion with him, you can 
research it.  But the wording in K&R clearly permits varying length 
octal constants.

> yves>> Assuming that grok_oct() consumes at most 3 octal digits, I think
> yves>> we can apply Karls patch. However I do think we should recommend
> yves>> against using octal IN REGULAR EXPRESSIONS. And should note that
> yves>> while you CAN use octal to represent codepoints up to 511 it is
> yves>> strongly recommended that you don't.
> I'd like to see three-digit octal always mean an 8-bit character, and
> discourage things like \3 and \33.  I don't think we should bother
> extending octal to allow for code points above "\377". That it "works"
> at all there is a problem.
> I included the older code because you'll see a pattern in it.
> For example:
>     scripts/badman: grep(/[^\001]+\001[^\001]+\001${ext}\001/ || /[^\001]+${ext}\001/,
>     scripts/badman: if ( /^([^\001]*)\002/ || /^([^\002]*)\001/ )  {
>     scripts/badman: if (/\001/) {
>     scripts/badman: if ($last eq "\033") {
>     scripts/badman: last if $idx_topic eq "\004";
>     scripts/badman: last if $idx_topic eq "\004" || $idx_topic eq '0';
>     scripts/badman: s/\033\+/\001/;
>     scripts/badman: s/\033\,/\002/;
>     scripts/badman: @tmplist = split(/\002/, $entry);
>     scripts/badman: $winsize = "\0" x 8;
> Notice someting?  Being once-and-always a C programmer at heart, by habit
> I always used to use a single digit for a NUL *only* and 3 digits otherwise.
> The Camel's use of
>     camel3-examples:if ( ("fred" & "\1\2\3\4") =~ /[^\0]/ ) { ... }
> is just something I do *not* like.  It does no good to warn in
> strings, where there are no backrefs (save in s///), but I'm not
> sure the level of warning appropriate for regexes.  

Again, you are seriously confusing things in your arguments; the C 
programmer was quite welcome to use 1, 2, or 3 octal digits.  So if you 
were, indeed, once-and-always a C programmer at heart, that alone 
wouldn't have convinced you to always use 3 digits.  So clearly you were 
also something else, besides a C programmer, if you picked up such a habit.

Now I learned C and Unix at approximately the same time (as did most of 
the older generation of C programmers), and it is true that the 
exactly-three-digit octal escape was used in other programs, although I 
couldn't name them now, likely grep was one of them; I encountered the 
phenomenon sometime in the first few months, certainly.  So I'm not 
suggesting that you didn't acquire the habit of using 3 digit octal 
escapes (except for NUL) in the early days and may well persist in using 
it, only that it wasn't C that taught you that, and it wasn't Perl that 
taught you that, and the documented K&R C and Perl definitions for octal 
escapes are remarkably similar.

> Karl, you've nothing to be sorry about.  Your courtesy, 
> conscientiousness, and can-do attitude are very welcome.

Indeed; I think we all appreciate new blood working on the code and a 
willing spirit.

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About