develooper Front page | perl.perl5.porters | Postings from November 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
karl williamson
November 13, 2008 22:03
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
Tom Christiansen wrote:
> Replying to Chip Salzenberg's message of "Wed, 12 Nov 2008 18:18:57 PST"
> and to Karl Williamson's of "Thu, 13 Nov 2008 11:38:48 MST":
>  *  There exist in octal character notation both implementation bugs as
>     well as built-in, by-design bugs, particular when used in regular 
>     expressions.
>  *  A few of these we've brought on ourselves, because we relaxed the
>     octal-char definition in ways that they designers of these things
>     never did, and so some of our troubles with them are our own fault.
>  *  The implementation bugs we can fix, if we're careful and consistent,
>     but design bugs we cannot.
>  *  Nor can we eliminate the notation altogether, due to the existing
>     massive code base that relies upon it.
>  *  The best we can do is generate, under certain circumstances, 
>     a warning related to an ambiguous \XXX being interpreted as 
>     either a backreference or a character.
> That's probably as far as many people may care to read, and that's fine.
> However, I do provide new info below that comes straight from the horse's
> mouth about the historical ambiguity--and I mean those horses once stabled
> at Murray Hill, not at JPL.
> First, what came before:
> :rafael> I don't think it's worth changing the meaning of \400 in
> :rafael> double quoted strings, or making it warn. However, in
> :rafael> regexps, it's too dangerously inconsistent and should be
> :rafael> deprecated. First, a deprecation warning seems in order.
> :rafael> However, I see some value in still allowing [\000-\377]
> :rafael> character ranges, for example. Do we really want to
> :rafael> deprecate that as well? This doesn't seem necessary.
> :yves>> Consider /\1/ means the first capture buffer of the previous
> :yves>> match, \17 means the _seventeenth_ capture buffer of the
> :yves>> previous match IFF the previous match contains more 17 or
> :yves>> more capture buffers, otherwise it means \x{F}.
> :yves>> In short: resolving the inconsistencies in octal notation in
> :yves>> regex notation would appear to be impossible.
> :rafael> Error messages are a mess, too. This one is correct:
> :rafael>     $ perl -wE '/\8/'
> :rafael>     Reference to nonexistent group in regex; marked by <-- HERE in m/\8
> :rafael>     <-- HERE / at -e line 1.
> :rafael> This one shows clearly that we're using a regexp that matches
> :rafael> "\x{1}8", but why is there a duplicated warning? Double magic?
> :rafael>     $ perl -wE '/\18/'
> :rafael>     Illegal octal digit '8' ignored at -e line 1.
> :rafael>     Illegal octal digit '8' ignored at -e line 1.
> And also:
> In-Reply-To: Chip's of "Wed, 12 Nov 2008 18:18:57 PST."
>              <>
> glenn>>> The [below] items could be added to the language immediately,
> glenn>>> during the deprecation cycle for \nnn octal notation [...]
> tchrist>> I find the notion of rendering illegal the existing octal
> tchrist>> syntax of "\33" is an *EXTRAĂ–RDINARILY* bad idea, a position I
> tchrist>> am prepared to defend at laborious length--and, if necessary,
> tchrist>> appeal to the Decider-in-Chief [...]
> chip> I am happy to mark my return to p5p by singing in harmony with
> chip> Tom C.
> chip> Perl's octal escapes are of venerable origin, coming as they do
> chip> from C -- not the newfangled ANSI and ISO dialects, let alone
> chip> Bjarne's heresy, but the earliest and purest syntax, which sprang
> chip> fully-formed from Ken's, Brian's and Dennis's foreheads.  Breaking
> chip> octal escapes would piss off lots of people, and break lots of
> chip> code, for no sufficiently valuable purpose.
> I'm at USENIX right now, and while Ken and Dennis aren't here, Andrew Hume
> *is*.  Andrew long worked in the fabled research group group there at
> Murray Hill, along with Brian and Rob and the rest of that seminal crew who
> charted much of this out.  Andrew wrote the first Plan9 grep program,
> gre(1), which was interesting because it internally broke up the pattern
> into nice DFA parts and unnice backtracking parts and attacked them
> separately. Rob and Ken later wrote purely DFA versions (no backtracking,
> no backreferencing) when they added UTF-8 support.
> So absent Ken, Andrew is probably the next best to ask this of, as he
> is particularly well-versed with regexes in all aspects: historical,
> current, standardized, etc.  It's he whom we refer to in the Camel's
> pattern-matching section when we write in the footnote:
>     It has been said(*) that programs that write programs 
>     are the happiest programs in the world.
> 	* By Andrew Hume, the famous Unix philosopher.
> I've just come from speaking with Andrew about all this, of whom 
> I posed the question:
>  1.  *Why* did Ken (et alios) select the same notation for 
>      backreferences in regex as for octally notated characters?
>  2.  *How* did you guys cope with its inherent ambiguity?
> Andrew said that they coped with the ambiguity mostly by always requiring
> exactly 3 octal digits for characters.  You couldn't say \33 for ESC; you
> had to say \033. If someone wanted the third capture buffer, they  write
> \3; if they wanted ETX, they wrote \003; \3 and \003 therefore *meant*
> *different* *things* in regexes, and the first was disallowed in strings.
> [snip]

As a point of reference, the C standard has always allowed 1, 2, or 3 
octal digits as a character constant (and asymmetrically as many as your 
want for hex).  But, as an old C programmer, I just wouldn't think of 
specifying one without exactly 3 digits.  I was somewhat surprised when 
I was researching acceptable Perl syntax to discover that a leading 0 
was not required, and now I discover that it also wasn't required in C 
all along.  (Although, I learned C before it was standardized and 
changed by that.  A leading 0 may have been so required before 
standardization.  I now regret throwing my first edition K&R away when 
the standard came out.) The only character I would likely have used that 
could be expressed in one octal digit would be BEL, and I would 
automatically express it as \007.  (I think the \a came along later, 
though I may just not have been aware of it.)  So C programmers likely 
will use 3 digits for octal character constants, for what that's worth

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About