develooper Front page | perl.perl5.porters | Postings from December 2004

On read and unicode

Thread Next
From:
perl5-porters
Date:
December 24, 2004 18:24
Subject:
On read and unicode
Message ID:
cqiiuu$tdn$1@post.home.lunix
Currently if you do:

my $in = "f\xfezz";
utf8::upgrade($in);
print "Old in='$in'\n";
print unpack("H*", $in), "\n";
read(STDIN, $in, 3, 2);
print "Now in='$in'\n";
print unpack("H*", $in), "\n";
print "is utf8:", utf8::is_utf8($in) ? 1 : 0, "\n";

and type "abc" as input, you get:

Old in='fþzz'
66c3be7a7a
abc
Now in='fþabc'
66c3be616263
is utf8:0

So it counted 2 (full utf8) chars forward in $in, then dropped the utf8
flag and added in the 3 new bytes.

I think this is an unnecessary exposure of the internal format of the old
$in string, and that

     $rc = read($fh, $buf, $len, $off)

should basically behave like:

     $buf = substr($buf, $off);
     # With the current semantics for the raw/unicode-ness of the filehandle:
     $rc = read($fh, my $tmp, $len);
     if ($rc) $buf .= $tmp;


In the above case that would return the same (internal) byte sequence,
but the utf8 flag would be on, and the second char value would be preserved
instead of being expanded to its utf-8 encoding, giving $in the
(to my mind) more logical value "fþabc"

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About