develooper Front page | perl.perl5.porters | Postings from October 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
From:
Glenn Linderman
Date:
October 27, 2008 14:22
Subject:
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
49063069.8090001@NevCal.com
On approximately 10/25/2008 7:47 PM, came the following characters from 
the keyboard of karl williamson:
> And, I forgot to mention that setting $v = "\400" is the same thing as 
> setting it to 256 or \x{100}.
> 
> Thus the only inconsistency I am aware of is in the area I patched. But, 
> as it's becoming increasingly obvious as I subscribe to this list, my 
> knowledge of the Perl language is minuscule.
> 
> I don't understand Glenn's point about using octal to fill whatever a 
> byte is on a machine, but no more.  Suppose there were a nine bit byte 
> machine (which one could argue is the maximum the C language 
> specification allows from their limiting a byte to be filled by 3 octal 
> digits).  What would those extra high-bit-set patterns represent if not 
> characters?  And what could one do with them if not?


Surely on a 9 bit byte machine, if there were any benefit character-wise 
of creating such characters, the users of those machines would 
understand that, and use the octal escape necessary to create the 
appropriate ordinals for those characters.

I rather expect that the primary use of 9 bit byte values would have 
been to initialize bytes to binary values, rather than deal in 
characters: I've never seen any character encodings that speak of 9-bit 
character values, until I just found http://tools.ietf.org/html/rfc4042 
via a Google search.  As the article states:

    By comparison, UTF-9 uses one to two nonets to represent codepoints
    in the BMP, three nonets to represent [UNICODE] codepoints outside
    the BMP, and three or four nonets to represent non-[UNICODE]
    codepoints.  There are no wasted bits, and as the examples in this
    document demonstrate, the computational processing is minimal.


> So, it seems to me that one either limits an octal constant to \377, or 
> one allows it up to \777 with them all potentially mapping into the 
> characters (or code points if you prefer) whose ordinal number 
> corresponds to the constant.  If we limit them, there is the possibility 
> that existing code will break, as in many places now they aren't 
> limited.  I don't know where all those places are.  If my patch is 
> accepted, then it gets rid of one place where there is an inconsistency; 
> and I know of no others.


Tom did point out two inconsistencies with octal values of 0777 below, 
and I suspect there will still be people that have latent bugs due to 
thinking like Tom did, that \400 should maybe be two characters, or 
might yet produce unexpected behaviors when coding \400 - \777 
accidentally, and not understanding why their string suddenly has the 
UTF-8 bit set!

But it is also true that there could be some code depending on this 
behavior, using it to generate 9 bit numbers, as \400 is shorter than 
\x{100} and would fit better on one line!

It might perhaps be handy to enhance the documentation to point out that 
  octal escapes in the range \400 through \777 do not fit within a 
single byte on 8-bit platforms, and can be used to generate characters 
with Unicode codepoints in the full range from \0 through \777, but that 
because Unicode codepoints are expressed in hexadecimal in all the 
documentation for Unicode, using octal notation for Unicode characters 
is unlikely to be quickly understood by the average programmer.

I still think it would be better to treat octal escapes greater than 
\377 as errors, or to deprecate octal escape syntax completely.


> Maybe we should let some others weigh in on the matter.


I waited two days, and no one else has weighed in... it would be nice if 
several additional opinions could be obtained.


> Tom Christiansen wrote:
>> Dear Glenn and Karl,

>> This would somewhat follow how "\x123" stops at \x12 (?)
>> rather than (ever(?)) generate a single string of length
>> one containing a char >8bits in length, giving instead a
>> two-char string "\x{12}3", which is different than the
>> longer string \x{123} would produce after encoding/decoding
>> for UTF-8 etc output.


I agree that a Perl-self-consistent behaviour could be achieved by 
limiting octal escapes to values in the range \0 - \377; and treating 
\400 as two characters; however, that would never be consistent with 
what K&R defined... and there are still today a fair number of 
programmers alive for whom K&R was their first programming book, or at 
least their first to introduce the octal escape.  Whether Larry 
considered that when making the unbraced \x escape, I couldn't say... 
but the unbraced \x escape clearly is limited to values is the range \0 
- \377: does that mean that Larry defined Perl to be an 8-bit-byte 
architecture?  Are there other examples, documentation, or history that 
prove that Perl has never run on a 9-bit architecture?


>> I think the font of folly is found in the way that an
>> *unbraced* \x takes TWO AND EXACTLY TWO CHARS following
>> it...
>>   ...whereas \<digit> takes 1, 2, *or* 3 (and what's this
>> about MS-pain about it taking more, anyway?) octal digits,
>> and that therein lay the rubbish that afflicted me.


I'm not enamored of having either Perl or MS-VC++ 6.0 (and probably 
later versions, and maybe earlier too) define a wide-character meaning 
for octal escapes in the range \400 - \777.  I think that except on an 
architecture with 9-bit bytes (or larger) their value in creating 
obfuscated Unicode codepoints is far exceeded by the confusion that 
would result when accidentally used.  Hence, my suggestion for an error.


>> There's no way {say, braces} to delimit the octal escape's
>> characters from what follows it, which seems to be the crux of
>> the problem here.  We can't put Z<> or \& strings, per POD or
>> troff respectively; we have to break them up.
>>
>> So you can't say "\{40}0" or "\0{40}0" or whatnot as you can
>> can with "\x{20}" and "\x{20}0".


While it might be possible to invent \{ooo} syntax, I think it would be 
better to enhance the documentation for octal escapes to recommend using 
hex escapes instead, pointing out the deficiencies and ambiguities in 
using octal escapes.

Deprecating octal escapes might be an even better solution... leaving 
the \n syntax to be unambiguously used in substitutions.


>> Probably this naïveté derives from having no direct experience with
>> "bytes" (native, non-utf[-]?8 wide/long chars) of length >8 bits.
>> Even on the shorter end of that stick, I've only a wee, ancient bit
>> of experience with bytes <8 bits.  That is, long ago and far away,
>> we used to pack six, 6-bit "RAD-50" chars into a 36-bit word under
>> EXEC-8, and sometimes used them even from DEC systems.


My understanding of 36-bit architectures was that generally characters 
were stored as 9-bit bytes (but only using 7 of the bits, since that is 
all ASCII needed), or to subset ASCII so that a "useful" subset of ASCII 
was available in 6 bits, which was more efficient in systems with 
limited RAM. (Looking back, it seems that RAM (core, then) was always 
limited!).  However, that was just from documentation I read once for a 
CDC (I think) machine, not from personal experience with one.


>> (See RAD-50 in BASIC-PLUS for one thing Larry *didn't* borrow
>> from there; I guess we get pack()'s BER w* whatzitzes instead.)
>>
>> Karl has clearly identified an area crying out for improvement
>> (read: an indisputable bug), and even better, he's sacrificed
>> his own mental elbow-grease to address the problem for the
>> greater good of us all.
>>
>> I can't see how to ask more--and so I strongly applaud his
>> generous contribution to the greater good of the making the
>> world a better place.
>>
>> I'm still a little skittish though, because as far as I noticed,
>> perhaps peremptorily, Karl's patch provided for /\0777/ or /\400/
>> scenarios alone: ie, regex m/atche/s.
>>
>> I meant only to say that addressing patterns alone while leaving
>> both strings and the CLI for -0 setting $/ out of the changes
>> risked introducing a petty inconsistency: a conceptual break
>> between `perl -0777` as an unimplementable "octal" byte spec
>> that therefore means undef()ing $/.
>>
>> Plus, there's how the docs equate $/ = "\0777" to undef $/.


Note that

$/ = "\0777"

would produce a two character string...

These uses of -0777 and \777 seem to be special cases, but they do add 
inconsistency to an otherwise consistent world resulting from Karl's 
patch, eh?  It seems clear that octal 777 was not expected to be used as 
a valid character value at the time it was defined with these other 
meanings; yet the acceptance of octal escapes greater than \377 as 
characters has produced the expectation, and even the reality, for 
Unicode, that it can sometimes be interpreted that way.  This is, 
perhaps, a still-remaining inconsistency in the syntax.  Should \777 be 
rejected as a character, to eliminate this inconsistency?  Or why not 
the whole range from \400 - \777?


>> This seems troublesome, and I'd wonder how it worked if that *were*
>> a legal sequence.  And I wonder the ramifications of "breaking" it;
>> it's really quite possible they're even worth doing, but I don't know.
>>
>> Again, I've never been somewhere a character or byte's ever
>> been more than 8 bits *NOR LESS THAN* 6, so I don't know what
>> expectations or experience in such hypothetical places might be.
>>
>> I'm sure some out there have broader experience than mine, and
>> hope to hear from them.
>>
>>     ^-----------------------------^
>>     | SUMMARY of Exposition Above |
>>     +=============================+
>>
>>  *  I agree there's a bug.
>>
>>  *  I believe Karl has produced a reasonable patch to fix it.
>>
>>  *  I wonder what *else* might/should also change in tandem with
>>     estimable amendment so as to:
>>
>>     ?  avoid evoking or astonishing any hobgoblins of
>>        foolish inconsistency (ie: breaking expectations)
>>
>>     ?  what (if any?) backwards-compat story might need spinning
>>        (ie: breaking code, albeit cum credible Apologia)
>>
>> Hope this makes some sense now. :(
>>
>> --tom
>>
>> PS:  And what about *Perl VI* in this treatment of "\0ctals", eh?!

Not sure what Perl VI is?  An implementation of VI in Perl?  An 
interface between Perl and VI?  I'm not a VI user, nor likely to become one.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About