develooper Front page | perl.perl5.porters | Postings from October 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
karl williamson
October 25, 2008 19:48
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
And, I forgot to mention that setting $v = "\400" is the same thing as 
setting it to 256 or \x{100}.

Thus the only inconsistency I am aware of is in the area I patched. 
But, as it's becoming increasingly obvious as I subscribe to this list, 
my knowledge of the Perl language is minuscule.

I don't understand Glenn's point about using octal to fill whatever a 
byte is on a machine, but no more.  Suppose there were a nine bit byte 
machine (which one could argue is the maximum the C language 
specification allows from their limiting a byte to be filled by 3 octal 
digits).  What would those extra high-bit-set patterns represent if not 
characters?  And what could one do with them if not?

So, it seems to me that one either limits an octal constant to \377, or 
one allows it up to \777 with them all potentially mapping into the 
characters (or code points if you prefer) whose ordinal number 
corresponds to the constant.  If we limit them, there is the possibility 
that existing code will break, as in many places now they aren't 
limited.  I don't know where all those places are.  If my patch is 
accepted, then it gets rid of one place where there is an inconsistency; 
and I know of no others.

Maybe we should let some others weigh in on the matter.

Tom Christiansen wrote:
> Dear Glenn and Karl,
>   +=============================+
>   | SUMMARY of Exposition Below |
>   v-----------------------------v
>   *  I fully agree there's a bug.
>   *  I believe Karl has produced a reasonable patch to fix it.
>   *  I wonder what *else* might/should also change in tandem
>      with this estimable amendment so as to:
>      ?  avoid evoking or astonishing any hobgoblins of
>         foolish inconsistency (ie: breaking bad expectations)
>      ?  what (if any?) backwards-compat story might need
>         spinning (ie, breaking code, albeit cum credible Apologia)
> tom++  [-: Congrats, you've hit 10% read; only 90% below! :-]
> /*
>  */
> On Saturday, 25 October 2008 at 13:20:20 PDT,
> Glenn Linderman <> wrote:
>> But not by interpreting them as a two-character octal escape,
>> followed by an ASCII 0-7 character (sorry Tom, I just can't
>> find a precedent for that!).
> Sorry?  Thank you, Glenn, for your courtesy.  Truth told,
> being human I do appreciate it. Yet at my hacker heart,
> I remain a meritocrat, or close to it.
> Thus apologies, let alone "deference"(?), are never obligatory
> when someone has something reasoned to contribute, even if it
> be to gently contradict someone--OR ANYONE.
> Technical arguments can, do, and must stand on their own, and no
> science-minded person should take the least offence for ever being
> shown he's been wrong in his calculations.  Indeed, he should be
> thankful for the enlightenment.
> And so I am; still, I appreciate your courtesy--and research.
> I may've been a bit loose in "shooting from the hip" by writing:
>>> I confess[/]guess I never *expected* "\400" to be "\x{100}",
>>> but rather "\x{20}0".
> That may come from having been myself recently bitten/burnt
> by all of "\1", "\11", and "\111" sometimes--and sometimes
> not--meaning octal specs for characters, even when I didn't
> intend this.
> How so?
> Because in true string-interpolation of the full qq!! variety,
> they all fit tidily into an 8-bit octet, wherein they *ALWAYS*
> mean chr(01), chr(011), and chr(0111).
> However, in "faux" qq!!ish processing, per m// and qr//, 
> whether they mean that in the regex <EM>depends on how 
> many captures regcomp()'s seen so far</EM>.
>   /*
>    * Er, I *believe*.  See, another of my hunches is that
>    * (??{...}) *may* monkey-wrench these matters.  I've
>    * not looked into that, and rather prefer not to. :(
>    */
> So, when I saw "\40", knowing that *any* other digit would
> exceed the 0000 .. 0377 range, I hunched it would stop there.
> Wrongly.
> And whether "would" or "should" are the more operative, or
> at least more desired, modality is the heart of this entire
> discussion we're having.
> This would somewhat follow how "\x123" stops at \x12 (?)
> rather than (ever(?)) generate a single string of length
> one containing a char >8bits in length, giving instead a
> two-char string "\x{12}3", which is different than the
> longer string \x{123} would produce after encoding/decoding
> for UTF-8 etc output.
> I think the font of folly is found in the way that an
> *unbraced* \x takes TWO AND EXACTLY TWO CHARS following
> it...
>   ...whereas \<digit> takes 1, 2, *or* 3 (and what's this
> about MS-pain about it taking more, anyway?) octal digits,
> and that therein lay the rubbish that afflicted me.
> There's no way {say, braces} to delimit the octal escape's
> characters from what follows it, which seems to be the crux of
> the problem here.  We can't put Z<> or \& strings, per POD or
> troff respectively; we have to break them up.
> So you can't say "\{40}0" or "\0{40}0" or whatnot as you can
> can with "\x{20}" and "\x{20}0".
> Now 5.10 gives us m/(stuff)\g{1}1/ to save one from the trouble
> that m/(stuff)\11/ would otherwise give you had you meant stuff,
> followed by the stuff in cap-1, then a literal digit-1, neither
> capture #(decimal-)11, ie \g{11}; nor chr(011), either.
> Because unbraced \x stops parsing after two hex digits, due both
> to compatibility with pre-Unicode days when then it would've
> otherwise generated a character whose code point would exceed
> U+0100 but also because long strings really need \x{BADBEEF}ish
> delimiters for safety and clarity, I putting no thought into it
> cavalierly imagined that \0 might in this behave analogously to
> how \x does.
> Not that that's how it works now, nor how I should DESIRE it
> to work.  I'm just explaining my (lack-of-)thought-process;
> foolish hobgoblins of little minds and all, you know.
> *PLEASE* misconstrue none of my chatterish kibitzing on this
> thread as somehow disapproving of more reliable, more
> predictable, more understandable, and more explicable behavior.
> Those are all admirable goals, and I support them--full stop.
> Probably this naïveté derives from having no direct experience with
> "bytes" (native, non-utf[-]?8 wide/long chars) of length >8 bits.
> Even on the shorter end of that stick, I've only a wee, ancient bit
> of experience with bytes <8 bits.  That is, long ago and far away,
> we used to pack six, 6-bit "RAD-50" chars into a 36-bit word under
> EXEC-8, and sometimes used them even from DEC systems.
> (See RAD-50 in BASIC-PLUS for one thing Larry *didn't* borrow
> from there; I guess we get pack()'s BER w* whatzitzes instead.)
> Karl has clearly identified an area crying out for improvement
> (read: an indisputable bug), and even better, he's sacrificed
> his own mental elbow-grease to address the problem for the
> greater good of us all.
> I can't see how to ask more--and so I strongly applaud his
> generous contribution to the greater good of the making the
> world a better place.
> I'm still a little skittish though, because as far as I noticed,
> perhaps peremptorily, Karl's patch provided for /\0777/ or /\400/
> scenarios alone: ie, regex m/atche/s.
> I meant only to say that addressing patterns alone while leaving
> both strings and the CLI for -0 setting $/ out of the changes
> risked introducing a petty inconsistency: a conceptual break
> between `perl -0777` as an unimplementable "octal" byte spec
> that therefore means undef()ing $/.
> Plus, there's how the docs equate $/ = "\0777" to undef $/.
> This seems troublesome, and I'd wonder how it worked if that *were*
> a legal sequence.  And I wonder the ramifications of "breaking" it;
> it's really quite possible they're even worth doing, but I don't know.
> Again, I've never been somewhere a character or byte's ever
> been more than 8 bits *NOR LESS THAN* 6, so I don't know what
> expectations or experience in such hypothetical places might be.
> I'm sure some out there have broader experience than mine, and
> hope to hear from them.
>     ^-----------------------------^
>     | SUMMARY of Exposition Above |
>     +=============================+
>  *  I agree there's a bug.
>  *  I believe Karl has produced a reasonable patch to fix it.
>  *  I wonder what *else* might/should also change in tandem with
>     estimable amendment so as to:
>     ?  avoid evoking or astonishing any hobgoblins of
>        foolish inconsistency (ie: breaking expectations)
>     ?  what (if any?) backwards-compat story might need spinning
>        (ie: breaking code, albeit cum credible Apologia)
> Hope this makes some sense now. :(
> --tom
> PS:  And what about *Perl VI* in this treatment of "\0ctals", eh?!

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About