develooper Front page | perl.perl5.porters | Postings from October 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
Glenn Linderman
October 25, 2008 10:48
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
On approximately 10/25/2008 7:32 AM, came the following characters from 
the keyboard of karl williamson:
> Tom Christiansen wrote:
>> Karl,
>> Thanks very much for the patch.
>> I confess guess I never *expected* "\400" to be "\x{100}", but rather 
>> "\x{20}0".   
> I'm not wedded to the interpretation I sumitted.  It came out of testing 
>  against the documentation that a blackslash in a regular expression can 
> be followed by 1-3 octal digits (the first one not needing to be a 0).  
> And the interpretation I submitted is the one that the pre-existing code 
> meant.  Witness
>  perl -le 'print "\x{100}\400" =~ /\x{100}\400/'
>  perl -le 'print "\400\x{100}" =~ /\400\x{100}/'
> both print 1.
> That is, If any other character anywhere in the pattern causes regcomp 
> to think that it should store the re as utf8, then \400 matches \400 and 
> so on up to the maximum 3 digit octal: \777 matches \777.  Failing that, 
> /\400/ will match \01\00.  It never matches "\x{20}0".
> So we have an existing bug.  sometimes \400 matches \400, and sometimes 
> it matches \01\00, depending on what I would call spooky action at a 
> distance.  (This means that \777 sometimes already matches \777 now.) 
> I'm trying to get rid of these consistencies.  I think something should 
> be done here, but perhaps its not what I thought it should be.  My patch 
> follows what the code was intending to do, but perhaps we should change 
> that intention.  Please guide me.

I understand where Tom is coming from, but he has no grounds for 
expecting "\400" to be the same as " 0".

I pulled out my old K&R, which is likely one of the earliest published 
books documenting the octal escape notation, and it explicitly says 
(section 2.3) "an arbitrary byte-sized bit pattern" can be created with 
an octal escape, and the also "maximum of 3 octal digits".  Section 2.2 
talks about both 8 and 9 bit characters on example architectures.

So it would be wrong to limit the values to 8-bits.  It is probably 
"platform specific" what interpretation should be applied to octal 
escapes that exceed the platform specific size of a byte, but it is not 
correct to assume that the octal escape "\400" ends after 2 digits 
simply because the numeric value of it exceeeds \377.  Writing such code 
is probably non-portable, because of the possible variation in byte sizes.

In systems with larger character values, it seems that:

1) numbers greater than "\377" could be interpreted as larger character 
values, at Karl proposes, but doing so is likely to cause confusion. 
Also, it should be pointed out that the escape was intended to fill a 
"byte", so it is my belief that octal escapes producing values that 
exceed the value of a platform-specific byte size should be rejected 
with an error.  I'm not sure if Perl supports systems with byte sizes 
other than 8 bits, but if it does, this would be a platform specific 
check.  Note that limiting the octal escape to 3 digits prevents the 
octal escape from being used to create all possible bit patterns for 
bytes larger than 12 bits (but I am unaware of any computer platform 
ever defining a byte larger than 10 bits).

2) Unicode values can clearly exceed 12 bits, so it seems that the octal 
escape is somewhat useless for creating all the possible values, so 
extending them to deal with values greater than the value of a 
platform-specific byte seems inappropriate, given all the documentation 

3) It seems much more likely, in my opinion, that an octal escape that 
exceeds the value of a platform specific byte is an error rather than an 
extension feature.

4) It is easy to convert octal escapes into hex escapes if any existing 
programs presently misusing octal escapes that exceed the value of a 
platform-specific byte would encounter versions of Perl that suddenly 
reject such values.  In fact, a clever error message might be crafted 
that says:

sprintf 'Octal escape sequence "%o" is invalid.  You probably meant 
"\x{%02.2x}\x{%02.2x}" or "\x{%04.4x}"', ender, ender / 8, ender % 8, ender

to help the programmer quickly fix the problem.

On the other hand MS VC++ 6.0 explicitly allows the use of the full 12 
bits possible to represent in an octal escape as an initializer for a 
wchar_t constant.

So there is precedent for Karl's scheme, even if there is no precedent 
for Tom's (that I could find).

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About