develooper Front page | perl.perl5.porters | Postings from November 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
Glenn Linderman
November 12, 2008 18:34
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
On approximately 11/12/2008 5:43 PM, came the following characters from 
the keyboard of Tom Christiansen:
> Glenn Linderman <> wrote:

> I find the notion of rendering illegal the existing octal syntax of "\33"
> is an *EXTRAÖRDINARILY* bad idea, a position I am prepared to defend at
> laborious length--and, if necessary, appeal to the Decider-in-Chief, who's
> always done everything possible to *NOT* break others' code without *VERY*
> *STRONG* reason.  I submit that that very high bar has *NOT* been met; far
> from it.  I'm rather hoping I shan't have to do any of that, but I certainly
> shall if I must.

Sure, I figured someone would say that.  It might as well be you :)

> There's no reason at all to delete it: because regexes have \g{1} now, and
> strings need never be written "\333" if you mean "\33" . "3".

That argument is specious; it is exactly the same as me saying that you 
don't need to write "\333" if you mean "\x{1b}3".

My understanding is that in a regex, if you have 3 matches, that "\333" 
might be more ambiguous than you are assuming.

> There is GREAT reason *not* to delete it, as the quantity of code you would
> see casually rendered illegal is incomprehensibly large, with the work
> involved in updating code, databases, config files, and educating
> programmers and users incalculable great.  To add insult to injury, this 
> work you would see thrust upon others, not taken on yourself.

Yep, that's a great reason.

> There is nothing fundamentally broken here, as there was for $*. This is
> trying to create a language where it is impossible to "think bad thoughts".
> One cannot succeed at that.

So you wish to convert people to using \g{3} but if \333 is not 
outlawed, it is still ambiguous.

> | I personally see no value in octal notation now that Unicode uses hex,
>     ^^^^^^^^^^                                                 ^^^^^^^^
> Good to see the prefatory warning that this your *personal* view. :-)
>                                                                vvvvvvvv

Yep.  I was rather sure that someone would bring up octal notation used 
for Unix file permission bits, where it is somewhat helpful in reading 
the bits.  But the -rwxrwxrwx notation is better anyway.

> As for "Unicode using hex", me, I've always thought of it as using bits.
> Rather, I think of the various standards specifying code points in the
> U+XXXXXX notation to mean code point at that hexadecimal number.  Not
> the same thing at all.  

Indeed, it does mean that, but I fail to see the distinction that you 
laboriously coded.  The number is the number regardless of the notation; 
however, the documentation for the number is in hex, so that form is 
much easier to find and use.  Unlike the ASCII chart, Unicode charts are 
generally not produced in octal or decimal, only in hex.

> | Another approach would be to change the escape from \nnn to
> | \o{nnnnn...} [···] The {} provide explicit delimiters, so octal
> | numbers could then achieve parity with hex in the range of numbers
> | available. If people think octal is still worth supporting, this looks
> | like a better syntax to support it wholeheartedly.
> That's not needed, unless you really want to promote octal for 
> Unicode strings.  

Really, the only thing Unicode has to do with this is the fact that it 
inspired Perl to support characters with ord > 255.  Once that support 
is there (and it is), it need not be used for Unicode characters, and 
can, in fact, be used for binary number sequences, and there exists code 
that uses it just that way, and if the coder of such code were enamored 
of octal, they might prefer to use octal notation to express the values 
of the binary numbers greater than 511 (as well as those that are 
smaller than or equal to 511).

> In a pattern, \g{1} now handles the situation
> you're talking about.  For DQ-strings, one can always avoid it.

Indeed, one can always avoid the ambiguous notation.  Using \g{n} for 
pattern matches, and \x{n} for characters.  Note the lack of octal 
notation.  Even if the existing octal notation is left intact, perhaps 
it should be documented as "not preferred" so that people get the habit 
of using unambiguous notations.  On the other hand, providing an 
additional notation that is unambiguously octal seems like it could be 
useful, if there really are people out there that like octal.

> Type "man ascii"; note that the table given first is octal.

So what?  Google  Ascii chart  and the first hit is which 
gives decimal and hex before octal.

> | Python 3.0 has moved to 0onnnnn for its octal integers (zero oh digit-
> | sequence) after concluding that leading zeros alone are just too
> | problematical, so the "o" indicator has a precedent (albeit recent) in
> | addition to reasonably intuitively meaning octal to anyone that
> | understands the hexadecimal notation and has ever heard of octal. The
> | 0o syntax could also be added to Perl integer constants outside of
> | strings/regices.
> My only trouble with the 0o notation is on fonts without cross 0's,
> and its gratuitous superfluousness.

You can pick your font, others can pick theirs, so that seems to be 
irrelevant.  The 0o notation is somewhat superfluous, and was only 
suggested to be consistent with the two forms of hex notation \x{} and 
0x if \o{} were to be invented.  It seems to be a more consistent 
proposal if 0o is included, than if not.

To summarize:

1) There is a real problem in regex notation to unambiguously interpret 
\n as octal or backreference.  Certain octal numbers cannot be 
expressed, depending on the number of backreferences in the regex.

2) A suggestion to add \o{} notation would permit octal numbers to be 
unambiguously specified in regex notation regardless of the number of 
backreferences, over the full existing range of supported octal numbers, 
and would also permit extending the range.  The 0o notation seems useful 
for consistency with the existing hex notations if the \o{} notation is 

3) Given 2, it becomes possible to deprecate and eventually remove \n as 
octal notation, either in regex or also in strings; it becomes possible 
to remove 0n as octal from numeric constants.

Point 2 would help address point 1, providing an alternate octal 
notation that isn't ambiguous with backrefs.  Point 3 would eliminate 
the ambiguity by making it an error.  Point 3 may break existing code, 

I'd be just as happy removing octal notation from everywhere except oct 
and %o, but I make these other suggestions because I figure some people 
still use, and want to use, octal notation.

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About