develooper Front page | perl.perl5.porters | Postings from April 2007

Simple things should be simple (was: Re: Smack!)

Thread Next
Tom Christiansen
April 20, 2007 08:24
Simple things should be simple (was: Re: Smack!)
Message ID:
>>Thou shalt not use

>>No, really, it's broken in so many ways, with so many side-effects and
>>actions at a distance.

But it's also the only simple thing that works simply.  See below.

>>    my $foo = chr 196;  # Auml
>>    utf8::upgrade($foo);
>>    print lc $foo;      # auml

>I could *swear* I tried that.  But yes, you're right.

I was taking uc(utf8::upgrade($foo)), which doesn't work because it doesn't
return the new string but rather changes its argument.

I don't really think it's "fair" that uc/lc/lcfirst/ucfirst/regexp
classes work fine on all characters but those 128..255.  That is,
0..127 work fine, and 256..inf work fine, but not the middle ones.

It's to my mind gratuitously onerous to the programmer that he should 
have to invoke a weird, non-loaded, built-in utf8::upgrade() on 
each and every datum in his program that he wants to work right 
with respect to Unicode sematics all because of this whole.  That is
not simple.  He should not have to know to do something weird 
to certain code points.  He especially don't want a per-datum fix;
he wants something he can tell his script.

I have tried many combinations of environment variables, locales, 
and pragmas, and I just can't get this to happen any hoter way.
Not even *this* works:

    perl -le 'binmode(STDOUT, "latin1"); print STDOUT chr 223' 
    perl -e  'binmode(STDIN, "latin1");  print uc scalar <STDIN>'

See, I can't even read it in and have it behave!  Blech!  It's true that we
do have a utf8 pragma, but that doesn't do any good either, of course,
simply to say I want: granting 128..255 proper character class semantics.
And yet, and yet, *this* works:

    perl -Mencoding=latin1 -le 'print uc chr 223'

But you tell me encoding broken and wrong-- and it very much is, because of
its global scope and strange action at a distance.  But little recourse

So I'm very much in favor of *SOME* sort of pragma, locale setting,
envariable, command switch, or other setting that allows me finally and
forever to simple be able to uc(chr 223) and get back SS without 
having to do super hoops of calling weird internal function every 
time I turn around.

In the meantime, since 

    chr 223
    "\x{00DF}"  # THIS IS UNICODE, DARN IT!

all fail to specify Unicode codepoint U+00DF, for strings in 
their programs I would much favor telling people

    pack("U", 223)



and even that over what you suggest:

    do { my $chr = chr 223; utf8::upgrade($chr); $chr }

Consider the nonfunctional

    my_charfunc( chr 223 )

What can we replace that with so it works?

    my_charfunc( pack("U", 223) )


    my_charfunc( v223 )


    my_charfunc( do { my $chr = chr 223; utf8::upgrade($chr); $chr } )

I really, really think chr() should produce a character, for else its name
evilly belies its function, but since it doesn't always live up to its
name, I guess we in the meanwhile need

    # can't use chr() because it doesn't return 
    # a unicode chr for 128..255!!!!
    sub unichr($) {  # expect integer
	return pack("U", $_[0]);

But that doesn't work for someone who binmode()s an input stream to be
ISO8859-1, currently a very counterintuitively useless no-op.  For that
I guess we need this

    # can't use binmode(FH, "latin1") because it doesn't
    # correctly read in codepoints 128..255 in a way that
    # grants them unicode semantics, so must pass all input 
    # from such a stream through
    sub unistr($) {
	my $bletch = $_[0];
	return $str;

This just isn't simple, so something seems broken.


PS: I wonder whether the guy trying to use 
	use encoding "utf8"
    Should have just said
	use utf8;
    Is *that* correctly lexically scoped?

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About