develooper Front page | perl.perl5.porters | Postings from August 2013

Re: [perl #119239] started out as doc clarification needed in'eval...but...

Thread Previous | Thread Next
From:
Ricardo Signes
Date:
August 16, 2013 14:54
Subject:
Re: [perl #119239] started out as doc clarification needed in'eval...but...
Message ID:
20130816145358.GB17471@cancer.codesimply.com
* Linda Walsh via RT <perlbug-followup@perl.org> [2013-08-16T03:13:14]
> If perl had taken that input as latin1, then why wouldn't I have seen
> the wide character warning on output?

Because Latin-1 has no wide characters.

Really, I don't like to bring up Latin-1.  It just confuses things.

These days, Perl is pretty good about acting like it only knows about code
points.  (Let's put aside what is stored in the octets used internally for the
scalar in memory.)  A string is a sequence of codepoints, which are just
non-negative integers.

"use utf8" says "while you read this source code in, decode it as UTF-8 and
use THOSE codepoints for everything, rather than the octets encoding it."

When you print, this happens:

  codepoints-in-your-string => fh layers => output destination

One common layer is encoding, which will encode your codepoints into UTF-8 (or
whatever) so that the output destination gets only octets, since UTF-8 encoding
results in a sequence of 8-bit values.

If you leave out an encoding layer, and your codepoints include things >255,
then there will be a warning, because you can't send 0x0100 to a bytestream.

Consider this program:

  use 5.18.0;
  {
    my $str = "“犬夜叉”";
    my @codepoints = split '', $str;
    say join q{ }, map {; sprintf 'U+%04X', ord } @codepoints;
    say $str;
    say $str =~ /\p{InCJK}/ ? "InCJK" : "Not InCJK";
  }

  say '-' x 78;

  {
    use utf8;
    my $str = "“犬夜叉”";
    my @codepoints = split '', $str;
    say join q{ }, map {; sprintf 'U+%04X', ord } @codepoints;
    say $str;
    say $str =~ /\p{InCJK}/ ? "InCJK" : "Not InCJK";
  }

In both cases, we're "say"-ing to STDOUT, which has no encoding layer applied.

The first block succeeds at sending the "right" thing to the terminal (assuming
the terminal is in UTF-8).  The regexp fails, though, because none of the
*fifteen* codepoints in the string has the InCJK property — and it is *right*
to fail.  The string is clearly *binary* data, not a text string... but it's
only clear to a human.  Perl doesn't, and can't, know.  It treats all strings
like text when you do texty stuff like matching.

The second block also "succeeds" at sending the right thing, but it's really a
guess.  perl sees that you're trying to fit U+201C into a byte-wide output
stream.  It sighs, emits a warning, then sends U+00E2 U+0080 U+009C in its
place.  The sigh and the warning are because you should have explicitly
encoded.  Meanwhile, the regexp match in the second block *does* match, because
most of those (> 0xFF) codepoints *do* match InCJK.

So, how do you know which string contains raw octets from files or terminal
reads (like the string in the first block) versus strings that contain Unicode
codepoints (like the string in the second block)?

** The only answer is: strict discipline

You have to keep track of what you've read in, either from the source code, a
filehandle, the terminal (which is a filehandle), and so on.  Then you need to
never forget.  The common practice is (or should be) to decode all input
immediately upon reading from a bytestream, then to encode it immediately
before outputting it to a bytestream.

> use 5.6.16;

^--- I think you meant something else, but it's irrelevant. :-)

> use utf8;
> #use P;
> use warnings;
> my $name= [qw (犬夜叉)];
> my $band={band => "Queensrÿche"};
> printf "string=%s, len=%s\n", $name->[0], length($name->[0]);
> printf "band=%s\n", $band->{band};
> ----
> I get corrupted output:
> /tmp/s.pl
> Wide character in printf at /tmp/s.pl line 8.
> string=犬夜叉, len=3
> band=Queensr
> 
> ----
> Isn't this sort of the opposite of what one would expect?

This is exactly right.

You have failed, like many, to grasp what makes Queensrÿche so great.  It isn't
Geoff Tate or their cool logo.  It's that the Unicode codepoint for ÿ is
U+00FF.  That means that it fits into a byte, so when you try to print it out,
Perl doesn't realize it needs to switch to emitting UTF-8.

This affects any codepoint between 0x80 and 0xFF inclusive, because they're too
big to be in the part where codepoints UTF-8-encode to their own value in one
octet, but not big enough to alert perl that the codepoint needs to be emitted
as its (happily very-very-close-to-UTF-8) internal representation.

If you're curious as to what all these codepoints are:

  perl -Mcharnames -E 'printf "U+%04X: %s\n", $_, charnames::viacode($_) for
  (0x80 .. 0xFF)'

This program suggests that Mötley Crüe will also trigger The Heavy Metal
Unicode Problem.

So, in conclusion:  the solution to the heavy metal problem plaguing Perl
programs is strict discipline.  I guess Pastor Mangielo was right after all.

-- 
rjbs

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About