develooper Front page | perl.perl5.porters | Postings from April 2007

Re: Simple things should be simple (was: Re: Smack!)

Thread Previous | Thread Next
Juerd Waalboer
April 20, 2007 08:44
Re: Simple things should be simple (was: Re: Smack!)
Message ID:
Tom Christiansen skribis 2007-04-20  9:23 (-0600):
> > Juerd wrote:
> > > Thou shalt not use
> But it's also the only simple thing that works simply.  See below.

Well, for some definition of "works" that I like to avoid in production
code ;)

> I don't really think it's "fair" that uc/lc/lcfirst/ucfirst/regexp
> classes work fine on all characters but those 128..255.  That is,
> 0..127 work fine, and 256..inf work fine, but not the middle ones.

It works on 128..255 iff the internal encoding happens to be UTF8. But
that means that the Perl user has to make sure the data is upgraded,
while upgrading data and setting UTF8 flags and such was supposed to be
transparent, and the user was not supposed to care about these things

> It's to my mind gratuitously onerous to the programmer that he should 
> have to invoke a weird, non-loaded, built-in utf8::upgrade() on 
> each and every datum in his program that he wants to work right 
> with respect to Unicode sematics all because of this whole.  That is
> not simple.  He should not have to know to do something weird 
> to certain code points.  He especially don't want a per-datum fix;
> he wants something he can tell his script.

Yes! However, this thing to tell the script should not change EVERY
string, and that's where goes wrong. It should remain
possible to use byte strings and byte operations, because we need that
for communication with things outside our Perly sources.

> See, I can't even read it in and have it behave!  Blech!  It's true that we
> do have a utf8 pragma, but that doesn't do any good either, of course,
> simply to say I want: granting 128..255 proper character class semantics.

Agreed, but we don't have a way to say that yet. Unfortunately. :(

The utf8 pragma only says: my source code is written in UTF-8.

> In the meantime, since 
>     "\xDF"
>     "\x{00DF}"  # THIS IS UNICODE, DARN IT!
>     "\337"

These three are syntactic variants of exactly the same thing. \x{} is
not unicode specific -- it's just a variable length version of \x.
\x{ff} is a perfectly normal way to get the *byte* 0xff (which when used
as a character is assumed to be � (yuml)).

(Except under "use encoding" of course, which changes the meaning of

> I really, really think chr() should produce a character, for else its name
> evilly belies its function, but since it doesn't always live up to its
> name, I guess we in the meanwhile need

If we separate the meaning of "character" from "byte" fully, 100%, a lot
of legacy code starts breaking, even though the programmers never did
anything wrong. We need some form in between, and that's latin1, where
bytes and codepoints happily map to the same values.

> PS: I wonder whether the guy trying to use 
> 	use encoding "utf8"
>     Should have just said
> 	use utf8;
>     Is *that* correctly lexically scoped?

korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About