Front page | perl.perl5.porters |
Postings from October 2008
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Thread Previous
|
Thread Next
From:
Glenn Linderman
Date:
October 25, 2008 10:48
Subject:
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
49035BE3.3060502@NevCal.com
On approximately 10/25/2008 7:32 AM, came the following characters from
the keyboard of karl williamson:
> Tom Christiansen wrote:
>> Karl,
>>
>> Thanks very much for the patch.
>>
>> I confess guess I never *expected* "\400" to be "\x{100}", but rather
>> "\x{20}0".
> I'm not wedded to the interpretation I sumitted. It came out of testing
> against the documentation that a blackslash in a regular expression can
> be followed by 1-3 octal digits (the first one not needing to be a 0).
> And the interpretation I submitted is the one that the pre-existing code
> meant. Witness
>
> perl -le 'print "\x{100}\400" =~ /\x{100}\400/'
> perl -le 'print "\400\x{100}" =~ /\400\x{100}/'
> both print 1.
>
> That is, If any other character anywhere in the pattern causes regcomp
> to think that it should store the re as utf8, then \400 matches \400 and
> so on up to the maximum 3 digit octal: \777 matches \777. Failing that,
> /\400/ will match \01\00. It never matches "\x{20}0".
>
> So we have an existing bug. sometimes \400 matches \400, and sometimes
> it matches \01\00, depending on what I would call spooky action at a
> distance. (This means that \777 sometimes already matches \777 now.)
> I'm trying to get rid of these consistencies. I think something should
> be done here, but perhaps its not what I thought it should be. My patch
> follows what the code was intending to do, but perhaps we should change
> that intention. Please guide me.
I understand where Tom is coming from, but he has no grounds for
expecting "\400" to be the same as " 0".
I pulled out my old K&R, which is likely one of the earliest published
books documenting the octal escape notation, and it explicitly says
(section 2.3) "an arbitrary byte-sized bit pattern" can be created with
an octal escape, and the also "maximum of 3 octal digits". Section 2.2
talks about both 8 and 9 bit characters on example architectures.
So it would be wrong to limit the values to 8-bits. It is probably
"platform specific" what interpretation should be applied to octal
escapes that exceed the platform specific size of a byte, but it is not
correct to assume that the octal escape "\400" ends after 2 digits
simply because the numeric value of it exceeeds \377. Writing such code
is probably non-portable, because of the possible variation in byte sizes.
In systems with larger character values, it seems that:
1) numbers greater than "\377" could be interpreted as larger character
values, at Karl proposes, but doing so is likely to cause confusion.
Also, it should be pointed out that the escape was intended to fill a
"byte", so it is my belief that octal escapes producing values that
exceed the value of a platform-specific byte size should be rejected
with an error. I'm not sure if Perl supports systems with byte sizes
other than 8 bits, but if it does, this would be a platform specific
check. Note that limiting the octal escape to 3 digits prevents the
octal escape from being used to create all possible bit patterns for
bytes larger than 12 bits (but I am unaware of any computer platform
ever defining a byte larger than 10 bits).
2) Unicode values can clearly exceed 12 bits, so it seems that the octal
escape is somewhat useless for creating all the possible values, so
extending them to deal with values greater than the value of a
platform-specific byte seems inappropriate, given all the documentation
that
3) It seems much more likely, in my opinion, that an octal escape that
exceeds the value of a platform specific byte is an error rather than an
extension feature.
4) It is easy to convert octal escapes into hex escapes if any existing
programs presently misusing octal escapes that exceed the value of a
platform-specific byte would encounter versions of Perl that suddenly
reject such values. In fact, a clever error message might be crafted
that says:
sprintf 'Octal escape sequence "%o" is invalid. You probably meant
"\x{%02.2x}\x{%02.2x}" or "\x{%04.4x}"', ender, ender / 8, ender % 8, ender
to help the programmer quickly fix the problem.
On the other hand MS VC++ 6.0 explicitly allows the use of the full 12
bits possible to represent in an octal escape as an initializer for a
wchar_t constant.
So there is precedent for Karl's scheme, even if there is no precedent
for Tom's (that I could find).
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Thread Previous
|
Thread Next