develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
March 31, 2007 09:27
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
20070331162711.GD31277@c4.convolution.nl
Ben Carter skribis 2007-03-31  4:08 (-0600):
> Unicode does not even HAVE characters, it has codepoints.  

Very good point, but Perl's documentation refers to codepoints as
"characters", and does that rather consistently.

I'm considering sweeping through the docs and changing it all, but it
would be a lot of work and a huge patch. I wonder if it's worth that.

> Now consider the case of
>   $y = chr(1000);
> Clearly whatever is in $y cannot be a single octet.  The way Perl
> currently works is that now $y is considered to be a string of Unicode
> codepoints. 

Yes.

But to go into a bit more detail for the more interesting case of
chr(233): this is either a byte string with only one byte, or a text
string with only one cha^Wcodepoint. Perl doesn't know, or care, so the
programmer has to.

> So $y contains a single codepoint, U+03E8.  The internal flag is used
> to indicate that the internal data pointer points to something that is
> a "Unicode codepoint string".

No, see Abigail's response for clarification.

>   print unpack("H*", pack("C", 1000));

Feeding 1000 to C has undefined behaviour: the C type can only handle
values 0..255, and there's no documentation defining what happens if you
feed it something <0 or >255. A similar thing occurs with floating point
numbers, like 64.5. The current implementation truncates that to 64,
without warning.

> If you expect values over 255, then you should not use "C".

Indeed!

> Of course if you have values over 255 you have to use "U" in unpack,
> that only makes sense!  

If these values are codepoints, yes. But if they're just numbers, other
unpack templates, like perhaps N or V are better.

> [1] I am deliberately ignoring the box in the corner labeled "EBCDIC".

Oh, so am I. In fact, I've probably never even seen such a box in my
short life so far.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About