develooper Front page | perl.perl5.porters | Postings from November 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
karl williamson
November 13, 2008 10:39
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
demerphq wrote:
> 2008/11/13 Tom Christiansen <>:
>>> My understanding is that in a regex, if you have 3 matches, that "\333"
>>> might be more ambiguous than you are assuming.
>>>> There is GREAT reason *not* to delete it, as the quantity of code you would
>>>> see casually rendered illegal is incomprehensibly large, with the work
>>>> involved in updating code, databases, config files, and educating
>>>> programmers and users incalculably great.  To add insult to injury, this
>>>> work you would see thrust upon others, not taken on yourself.
>>> Yep, that's a great reason.
>> I'm glad you agree, easily suffices to shut off the rathole.
>> And in case it doesn't, the output below will convince anyone
>> that we ***CANNOT*** remove \0ctal notation.  Larry would never
>> allow you to break so many people's code.  It would be the worst
>> thing Perl has ever done to its users.  It verges upon the insane.
> First please separate what Glenn said from what Rafael and I said,
> which is that it might be a good idea to deprecate octal IN REGULAR
> I spoke perhaps more harshly than I meant originally, which is what
> kicked this off. I should have said "strongly discouraged" and not
> "deprecated".
> Obviously from a back compat viewpoint we can't actually remove octal
> completely FROM THE REGEX ENGINE. At the very least there is a large
> amount of code that either generates octal sequences or contains them
> But we sure can say n the docs that "it is recommended that you do not
> use octal in regular expressions in new code as it is ambiguous as to
> how they will be interpreted, especially low value octal (excepting
> \0) can easily be mistaken for a backreference".
>> Grepping for \\\d *ONLY* in the indented code segments of the standard pods:
> Oh cmon! You of all people must know a whole whack of ways to count
> them. You dont have to include them all in a mail. Gmail didn't even
> let me see the full list. The list also is a bit off-topic* as very
> few of those are actually in regular expressions, and amusingly the
> second item in your list isn't octal. Illustrating the problem nicely.
> Personally I dislike ambiguous syntax and think it should in general
> be avoided, and that maybe we should do something to make it easier to
> see when there is ambiguous syntax. And I especially dislike ambiguous
> syntax that can be made to change meaning by action at a distance. If
> I concatenate a pattern that contains an octal sequence to a pattern
> that contains a bunch of capture buffers the meaning of the "octal"
> changes. That is bad.
> Assuming that grok_oct() consumes at most 3 octal digits, I think we
> can apply Karls patch. However I do think we should recommend against
> using octal IN REGULAR EXPRESSIONS. And should note that while you CAN
> use octal to represent codepoints up to 511 it is strongly recommended
> that you don't.
> Also I have a concern that Karls patch merely modifies the behaviour
> in the regular expression engine. It doesn't do the same for other
> strings. If it is going to be legal it should be legal everywhere.
grok_oct() itself consumes as many octal digits as there are in its 
parameter, as long as the result doesn't overflow a UV.  It is used for 
general purpose octal conversion, such as from the oct() function.

My patch was to bring consistency to the handling of \400-\777.  Outside 
re's, putting them into a string variable will cause the string to be 
converted to utf8, and so they will be converted into two utf8 bytes as 
part of that string.  Similarly, using any of these octal values in an 
re charclass will cause the re to be converted to utf8, and will match 
the corresponding unicode code point.  But when values in this range 
appear in an re outside a charclass there an inconsistency.  On an 8-bit 
character machine (if there aren't 256 or so parenthetical sub 
expressions in the re) they will match a two character sequence, but not 
the same utf8 sequence matched if they had instead appeared in a 
charclass.  I'm not sure what would happen on a 9-bit machine.  It might 
very well be what Glenn suggests, the corresponding 9 bits.

Tom has pointed out that \777 is a reserved value in some contexts.

It seems to me to be a bad idea to remove acceptance of octal numbers in 

It seems like a good idea to add something to the language so one can 
express them unambiguously.  Even I with my limited knowledge of 
regcomp.c could do it easily (fools rush in...).

And it seems like an even better idea to handle them consistently.  I 
see two ways to do that 1) accept my patch; or 2) forbid or warn about 
the use of those larger than a single character in the machine 
architecture in both strings and re's, including char classes.

Perhaps I've forgotten something in this thread.  If so, I'm sorry.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About