Front page | perl.perl5.porters |
Postings from October 2008
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Thread Previous
|
Thread Next
From:
Glenn Linderman
Date:
October 27, 2008 14:22
Subject:
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
49063069.8090001@NevCal.com
On approximately 10/25/2008 7:47 PM, came the following characters from
the keyboard of karl williamson:
> And, I forgot to mention that setting $v = "\400" is the same thing as
> setting it to 256 or \x{100}.
>
> Thus the only inconsistency I am aware of is in the area I patched. But,
> as it's becoming increasingly obvious as I subscribe to this list, my
> knowledge of the Perl language is minuscule.
>
> I don't understand Glenn's point about using octal to fill whatever a
> byte is on a machine, but no more. Suppose there were a nine bit byte
> machine (which one could argue is the maximum the C language
> specification allows from their limiting a byte to be filled by 3 octal
> digits). What would those extra high-bit-set patterns represent if not
> characters? And what could one do with them if not?
Surely on a 9 bit byte machine, if there were any benefit character-wise
of creating such characters, the users of those machines would
understand that, and use the octal escape necessary to create the
appropriate ordinals for those characters.
I rather expect that the primary use of 9 bit byte values would have
been to initialize bytes to binary values, rather than deal in
characters: I've never seen any character encodings that speak of 9-bit
character values, until I just found http://tools.ietf.org/html/rfc4042
via a Google search. As the article states:
By comparison, UTF-9 uses one to two nonets to represent codepoints
in the BMP, three nonets to represent [UNICODE] codepoints outside
the BMP, and three or four nonets to represent non-[UNICODE]
codepoints. There are no wasted bits, and as the examples in this
document demonstrate, the computational processing is minimal.
> So, it seems to me that one either limits an octal constant to \377, or
> one allows it up to \777 with them all potentially mapping into the
> characters (or code points if you prefer) whose ordinal number
> corresponds to the constant. If we limit them, there is the possibility
> that existing code will break, as in many places now they aren't
> limited. I don't know where all those places are. If my patch is
> accepted, then it gets rid of one place where there is an inconsistency;
> and I know of no others.
Tom did point out two inconsistencies with octal values of 0777 below,
and I suspect there will still be people that have latent bugs due to
thinking like Tom did, that \400 should maybe be two characters, or
might yet produce unexpected behaviors when coding \400 - \777
accidentally, and not understanding why their string suddenly has the
UTF-8 bit set!
But it is also true that there could be some code depending on this
behavior, using it to generate 9 bit numbers, as \400 is shorter than
\x{100} and would fit better on one line!
It might perhaps be handy to enhance the documentation to point out that
octal escapes in the range \400 through \777 do not fit within a
single byte on 8-bit platforms, and can be used to generate characters
with Unicode codepoints in the full range from \0 through \777, but that
because Unicode codepoints are expressed in hexadecimal in all the
documentation for Unicode, using octal notation for Unicode characters
is unlikely to be quickly understood by the average programmer.
I still think it would be better to treat octal escapes greater than
\377 as errors, or to deprecate octal escape syntax completely.
> Maybe we should let some others weigh in on the matter.
I waited two days, and no one else has weighed in... it would be nice if
several additional opinions could be obtained.
> Tom Christiansen wrote:
>> Dear Glenn and Karl,
>> This would somewhat follow how "\x123" stops at \x12 (?)
>> rather than (ever(?)) generate a single string of length
>> one containing a char >8bits in length, giving instead a
>> two-char string "\x{12}3", which is different than the
>> longer string \x{123} would produce after encoding/decoding
>> for UTF-8 etc output.
I agree that a Perl-self-consistent behaviour could be achieved by
limiting octal escapes to values in the range \0 - \377; and treating
\400 as two characters; however, that would never be consistent with
what K&R defined... and there are still today a fair number of
programmers alive for whom K&R was their first programming book, or at
least their first to introduce the octal escape. Whether Larry
considered that when making the unbraced \x escape, I couldn't say...
but the unbraced \x escape clearly is limited to values is the range \0
- \377: does that mean that Larry defined Perl to be an 8-bit-byte
architecture? Are there other examples, documentation, or history that
prove that Perl has never run on a 9-bit architecture?
>> I think the font of folly is found in the way that an
>> *unbraced* \x takes TWO AND EXACTLY TWO CHARS following
>> it...
>> ...whereas \<digit> takes 1, 2, *or* 3 (and what's this
>> about MS-pain about it taking more, anyway?) octal digits,
>> and that therein lay the rubbish that afflicted me.
I'm not enamored of having either Perl or MS-VC++ 6.0 (and probably
later versions, and maybe earlier too) define a wide-character meaning
for octal escapes in the range \400 - \777. I think that except on an
architecture with 9-bit bytes (or larger) their value in creating
obfuscated Unicode codepoints is far exceeded by the confusion that
would result when accidentally used. Hence, my suggestion for an error.
>> There's no way {say, braces} to delimit the octal escape's
>> characters from what follows it, which seems to be the crux of
>> the problem here. We can't put Z<> or \& strings, per POD or
>> troff respectively; we have to break them up.
>>
>> So you can't say "\{40}0" or "\0{40}0" or whatnot as you can
>> can with "\x{20}" and "\x{20}0".
While it might be possible to invent \{ooo} syntax, I think it would be
better to enhance the documentation for octal escapes to recommend using
hex escapes instead, pointing out the deficiencies and ambiguities in
using octal escapes.
Deprecating octal escapes might be an even better solution... leaving
the \n syntax to be unambiguously used in substitutions.
>> Probably this naïveté derives from having no direct experience with
>> "bytes" (native, non-utf[-]?8 wide/long chars) of length >8 bits.
>> Even on the shorter end of that stick, I've only a wee, ancient bit
>> of experience with bytes <8 bits. That is, long ago and far away,
>> we used to pack six, 6-bit "RAD-50" chars into a 36-bit word under
>> EXEC-8, and sometimes used them even from DEC systems.
My understanding of 36-bit architectures was that generally characters
were stored as 9-bit bytes (but only using 7 of the bits, since that is
all ASCII needed), or to subset ASCII so that a "useful" subset of ASCII
was available in 6 bits, which was more efficient in systems with
limited RAM. (Looking back, it seems that RAM (core, then) was always
limited!). However, that was just from documentation I read once for a
CDC (I think) machine, not from personal experience with one.
>> (See RAD-50 in BASIC-PLUS for one thing Larry *didn't* borrow
>> from there; I guess we get pack()'s BER w* whatzitzes instead.)
>>
>> Karl has clearly identified an area crying out for improvement
>> (read: an indisputable bug), and even better, he's sacrificed
>> his own mental elbow-grease to address the problem for the
>> greater good of us all.
>>
>> I can't see how to ask more--and so I strongly applaud his
>> generous contribution to the greater good of the making the
>> world a better place.
>>
>> I'm still a little skittish though, because as far as I noticed,
>> perhaps peremptorily, Karl's patch provided for /\0777/ or /\400/
>> scenarios alone: ie, regex m/atche/s.
>>
>> I meant only to say that addressing patterns alone while leaving
>> both strings and the CLI for -0 setting $/ out of the changes
>> risked introducing a petty inconsistency: a conceptual break
>> between `perl -0777` as an unimplementable "octal" byte spec
>> that therefore means undef()ing $/.
>>
>> Plus, there's how the docs equate $/ = "\0777" to undef $/.
Note that
$/ = "\0777"
would produce a two character string...
These uses of -0777 and \777 seem to be special cases, but they do add
inconsistency to an otherwise consistent world resulting from Karl's
patch, eh? It seems clear that octal 777 was not expected to be used as
a valid character value at the time it was defined with these other
meanings; yet the acceptance of octal escapes greater than \377 as
characters has produced the expectation, and even the reality, for
Unicode, that it can sometimes be interpreted that way. This is,
perhaps, a still-remaining inconsistency in the syntax. Should \777 be
rejected as a character, to eliminate this inconsistency? Or why not
the whole range from \400 - \777?
>> This seems troublesome, and I'd wonder how it worked if that *were*
>> a legal sequence. And I wonder the ramifications of "breaking" it;
>> it's really quite possible they're even worth doing, but I don't know.
>>
>> Again, I've never been somewhere a character or byte's ever
>> been more than 8 bits *NOR LESS THAN* 6, so I don't know what
>> expectations or experience in such hypothetical places might be.
>>
>> I'm sure some out there have broader experience than mine, and
>> hope to hear from them.
>>
>> ^-----------------------------^
>> | SUMMARY of Exposition Above |
>> +=============================+
>>
>> * I agree there's a bug.
>>
>> * I believe Karl has produced a reasonable patch to fix it.
>>
>> * I wonder what *else* might/should also change in tandem with
>> estimable amendment so as to:
>>
>> ? avoid evoking or astonishing any hobgoblins of
>> foolish inconsistency (ie: breaking expectations)
>>
>> ? what (if any?) backwards-compat story might need spinning
>> (ie: breaking code, albeit cum credible Apologia)
>>
>> Hope this makes some sense now. :(
>>
>> --tom
>>
>> PS: And what about *Perl VI* in this treatment of "\0ctals", eh?!
Not sure what Perl VI is? An implementation of VI in Perl? An
interface between Perl and VI? I'm not a VI user, nor likely to become one.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Thread Previous
|
Thread Next