develooper Front page | perl.perl5.porters | Postings from October 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
karl williamson
October 25, 2008 12:26
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
I forgot to mention that currently (5.8 5.10),
  perl -le 'print "\400" =~ /[\400]/'

prints 1. So again the current implementation is inconsistent

Glenn Linderman wrote:
> On approximately 10/25/2008 7:32 AM, came the following characters from 
> the keyboard of karl williamson:
>> Tom Christiansen wrote:
>>> Karl,
>>> Thanks very much for the patch.
>>> I confess guess I never *expected* "\400" to be "\x{100}", but rather 
>>> "\x{20}0".   
>> I'm not wedded to the interpretation I sumitted.  It came out of 
>> testing  against the documentation that a blackslash in a regular 
>> expression can be followed by 1-3 octal digits (the first one not 
>> needing to be a 0).  And the interpretation I submitted is the one 
>> that the pre-existing code meant.  Witness
>>  perl -le 'print "\x{100}\400" =~ /\x{100}\400/'
>>  perl -le 'print "\400\x{100}" =~ /\400\x{100}/'
>> both print 1.
>> That is, If any other character anywhere in the pattern causes regcomp 
>> to think that it should store the re as utf8, then \400 matches \400 
>> and so on up to the maximum 3 digit octal: \777 matches \777.  Failing 
>> that, /\400/ will match \01\00.  It never matches "\x{20}0".
>> So we have an existing bug.  sometimes \400 matches \400, and 
>> sometimes it matches \01\00, depending on what I would call spooky 
>> action at a distance.  (This means that \777 sometimes already matches 
>> \777 now.) I'm trying to get rid of these consistencies.  I think 
>> something should be done here, but perhaps its not what I thought it 
>> should be.  My patch follows what the code was intending to do, but 
>> perhaps we should change that intention.  Please guide me.
> I understand where Tom is coming from, but he has no grounds for 
> expecting "\400" to be the same as " 0".
> I pulled out my old K&R, which is likely one of the earliest published 
> books documenting the octal escape notation, and it explicitly says 
> (section 2.3) "an arbitrary byte-sized bit pattern" can be created with 
> an octal escape, and the also "maximum of 3 octal digits".  Section 2.2 
> talks about both 8 and 9 bit characters on example architectures.
> So it would be wrong to limit the values to 8-bits.  It is probably 
> "platform specific" what interpretation should be applied to octal 
> escapes that exceed the platform specific size of a byte, but it is not 
> correct to assume that the octal escape "\400" ends after 2 digits 
> simply because the numeric value of it exceeeds \377.  Writing such code 
> is probably non-portable, because of the possible variation in byte sizes.
> In systems with larger character values, it seems that:
> 1) numbers greater than "\377" could be interpreted as larger character 
> values, at Karl proposes, but doing so is likely to cause confusion. 
> Also, it should be pointed out that the escape was intended to fill a 
> "byte", so it is my belief that octal escapes producing values that 
> exceed the value of a platform-specific byte size should be rejected 
> with an error.  I'm not sure if Perl supports systems with byte sizes 
> other than 8 bits, but if it does, this would be a platform specific 
> check.  Note that limiting the octal escape to 3 digits prevents the 
> octal escape from being used to create all possible bit patterns for 
> bytes larger than 12 bits (but I am unaware of any computer platform 
> ever defining a byte larger than 10 bits).
> 2) Unicode values can clearly exceed 12 bits, so it seems that the octal 
> escape is somewhat useless for creating all the possible values, so 
> extending them to deal with values greater than the value of a 
> platform-specific byte seems inappropriate, given all the documentation 
> that
> 3) It seems much more likely, in my opinion, that an octal escape that 
> exceeds the value of a platform specific byte is an error rather than an 
> extension feature.
> 4) It is easy to convert octal escapes into hex escapes if any existing 
> programs presently misusing octal escapes that exceed the value of a 
> platform-specific byte would encounter versions of Perl that suddenly 
> reject such values.  In fact, a clever error message might be crafted 
> that says:
> sprintf 'Octal escape sequence "%o" is invalid.  You probably meant 
> "\x{%02.2x}\x{%02.2x}" or "\x{%04.4x}"', ender, ender / 8, ender % 8, ender
> to help the programmer quickly fix the problem.
> On the other hand MS VC++ 6.0 explicitly allows the use of the full 12 
> bits possible to represent in an octal escape as an initializer for a 
> wchar_t constant.
> So there is precedent for Karl's scheme, even if there is no precedent 
> for Tom's (that I could find).

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About