develooper Front page | perl.perl5.porters | Postings from May 2010

Re: RFC: Consolidated proposal for octals like \400 in strings. Was:PATCH [perl #59342] chr(0400)

Thread Previous | Thread Next
From:
karl williamson
Date:
May 13, 2010 22:14
Subject:
Re: RFC: Consolidated proposal for octals like \400 in strings. Was:PATCH [perl #59342] chr(0400)
Message ID:
4BECDBE4.5050006@khwilliamson.com
I opened this ticket and submitted a patch for it more than a year and a 
half ago.  It was my first patch for any open source project.  It 
stimulated a significant amount of sometimes almost vituperative 
discussion on p5p (none, as far as I can tell, directed against me). 
Yesterday, in re-reading the correspondence about this, I felt like a 
little kid whose off-hand comment throws his parents into a big fight, 
while he helplessly looks on, not fully understanding what's going on. 
If I hadn't been a professional software developer who had been through 
many such battles over the years (although being paid handsomely to do 
so), I can imagine that I would have been scared off from ever 
participating again.

The end result of the discussion was a different patch than what I had 
submitted, for 5.10.1, with the expectation that further work would 
happen for 5.12.  I did not get around to looking at it in 5.12, but am now.

I also am much more familiar with the Perl source code, or at least 
applicable corners of it, so can more fully understand the arguments 
than I did then.  So, I set out to finish up with this bug, thinking I 
knew what should be done.  But in re-examining things, I realized I'm 
not sure how to proceed.

I ended up undertaking an inventory of the current situation.  I 
examined the source code and ran experiments on blead to confirm. 
Here's what I found:  There were four areas mentioned in the discussion 
that use octal to specify characters, and I know of no others.  All of 
them accept 1-3 digits, and anything beyond three is considered to be 
part of a new token, as is anything in [^0-7].  I also looked at what 
happens if the first digit is not an octal, i.e., is an 8 or 9.

1) On the command line, -0ddd specifies $/, the record separator.  If 
ddd is above 0377 in octal, it sets the record separator to undef, which 
causes the whole file to be slurped at once.  Even though all characters 
above 0377 have this behavior, only the single value 0777 is documented 
as doing so.  '-0' alone implies NULL as the record separator, which is 
documented.  Anything outside of [0-7] is considered the next option, so 
-08 is the same thing as -0 -8, and there is no legal -8 option, so an 
error is generated.

2) In a double-quoted string, "\ddd", will generate the character whose 
ordinal is ddd in octal.  If the value is above 0377, it will generate 
the corresponding Unicode character, and the string will be in utf8.  No 
warning is given.  [^0-7] starts a new character.  \8 or \9 is the same 
as the characters 8 or 9 (without the backslash), and an "Unrecognized 
escape \d passed through" message is given under warnings.  There is one 
set of tests in the test suite that has a value above 0377, t/uni/latin2.t

3) In a substitution replacement, the behavior is similar to 
double-quotish: "\ddd", will generate the character whose ordinal is ddd 
in octal.  If the value is above 0377, it will generate the 
corresponding Unicode character, and the string will be in utf8.  No 
warning is given.  [^0-7] starts a new character.  However, single 
digits, like \1 are transformed into $1, for \1 to \9, with a warning 
generated.  The comments say that these are being deprecated, but the 
warning is not a deprecation warning.  This transformation is done 
without regard to how many capture groups were actually saved, so things 
like s/(foo)/\2/ will generate a "Use of uninitialized value $2 in 
substitution iterator" warning.

4) The final area is regex patterns, and there are actually several 
possibilities for this.

In a bracketed character class, it behaves similarly to what happens in 
a double quoted string: qr/[\ddd]/, will match the character whose 
ordinal is ddd in octal.  If the value is above 0377, it will match the 
corresponding Unicode character, and the pattern will be in utf8.  No 
warning is given.  [^0-7] starts a new character.  However, if there are 
fewer than three octal digits followed by an 8 or 9 (including [\8] or 
[\9]), an erroneous warning message is generated (twice!) that the 8 or 
9 is ignored.  This message is false, the 8 or 9 is considered a 
separate character to go in the class.  Using -Dr on the command line 
for /[\8]/ or /[\9]/ gives the wrong result; I used the debugger to 
verify things.  Devel::Peek is also wrong, saying the regex is 
"(?-xism:[\\8a])".  Until 5.12 it was not really documented that octal 
constants could be used in a character class.

Outside a character class, a backslash followed by a single decimal 
digit [1-9] is considered a decimal backreference to a capture group. 
An error is generated if the number of groups is less than that digit. 
Otherwise, two or more decimal digits in a row (the first of which is 
not a 0) whose decimal value doesn't exceed the number of such groups is 
considered a decimal backreference to one of those groups.  Otherwise, 
the construct is parsed as an octal-specified character.  That is, if 
the first digit after the backslash is a 0, or if the entire string of 
decimal digits in a row interpreted decimally exceeds the number of 
capture groups, the number is attempted to be interpreted as an 
octal-specified character, looking at most at the 3 characters following 
the backslash.  If one of those three characters is an 8 or 9, an 
erroneous warning message is given (twice!) that it is ignored, like the 
character class case, and the dumped value appears incorrect if it is 
the case that the first digit is is an 8 or 9.  (I didn't look to see if 
the optimizer got rid of the initial NULL, so it might be the case that 
the dumped value is correct; but I doubt it.)  If \ddd exceeds 0377, the 
behavior differs depending on whether or not the pattern is UTF-8.  If 
it is UTF-8, the Unicode character corresponding to ddd interpreted 
octally is generated with no warning given.  If it isn't UTF-8, the 
native character set character corresponding to ddd modulo 256 is 
generated.  Starting in 5.10.1 the last case generates a 
deprecated-class warning message.

The behavior of taking the character modulo 256 is clearly wrong, and 
that was the patch that got applied in 5.10.1, and is still in 5.12: to 
deprecate using a construct that would cause this, though the modulo is 
still done.  I came up with a proposal finally that met mostly with 
approval.  But that has been overtaken by events, and needs to be 
revisited: to extend the 'use legacy' pragma that we were intending to 
put into 5.12 to handle this case.  'use legacy' was killed before 5.12 
shipped.

Let me summarize a little of the results of the discussion.  No one 
until this email now had figured out how things really work currently, 
so some of the opinions were based on false suppositions.

It was proposed that character specifications in octal in general (maybe 
even all octal numbers) be deprecated with an eye to eventually removing 
them from the language.  That generated most of the heat.  And so, octal 
numbers are forever retained.

It was proposed that a new construct to mean an octal number be put in: 
0odddd, to correspond with 0xdddd.  I myself have no problems with this, 
but it is not directly related to this issue, and I myself would 
probably never get around to expending the effort to implement it.

It was proposed to deprecate use of octally-specified characters that 
don't fit into a byte, unless 'use legacy' was specified.  A test was 
added to Configure in 5.12 to warn against use of Perl in architectures 
which have > 8 bit bytes (as various shifts and array accesses assume an 
8-bit byte implicitly.)  The C standard effectively requires a U8 to be 
at least 8 bits long; the Configure test makes sure it is exactly 8 bits 
long.  Therefore this proposal effectively would deprecate using 0400 
and above to specify a character.

It was proposed to add the syntax \o{ddddd...} to specify characters 
using octal.  This would be a way to use octal notation to specify 
characters that require utf8.  Several people gave grudging assent to 
this, and no one was dead-set against it.  I for one, think it's a good 
idea, and plan to do the work.  Yves pointed out that this would be a 
way for someone to unambiguously specify in a regex pattern that they 
meant an octal constant, and not a backreference.

And finally, here's what I think should be done.  First, I've 
encountered a number of minor problems, such as erroneous error 
messages, and what appears to be improper dumping, and those can be 
fixed non-controversially.

I think the -0 command line behavior is adequate, and I've already 
prepared a patch to document it.

I think that \o{...} is worthwhile, and will submit a patch to do it.

The main issue is what to do about deprecating 0400-0777 to specify 
characters octally when the braces form isn't used.  If we do this, I 
think it would have to be in conjunction with a new 'use feature'; and 
I'm ambivalent about it.  After all that's been said, I don't see the 
harm in them.  It may not be common to use such large numbers, and it 
may not have been the intent of the perl developers that they be used, 
but it isn't a misuse of the construct, and I don't see these reasons as 
sufficient to deprecate them.  Another option would be to warn but not 
deprecate.  I could go with that, but I'm doubtful.

The improper behavior in taking the character modulo 256 has been 
deprecated in 5.10.1 and 5.12, so we should feel free to change it.  If 
we agree on the paragraph just above, I propose to just take the 
character not modulo 256, which would bring this behavior into line with 
all the other cases.  It turns out that this is what the original patch 
I submitted does.

And, a question that came up from my investigation: Should the message 
when someone uses \1 in a substitution replacement actually be changed 
to have deprecation-class?

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About