develooper Front page | perl.perl5.porters | Postings from December 2004

On read and unicode

Thread Next
December 24, 2004 18:24
On read and unicode
Message ID:
Currently if you do:

my $in = "f\xfezz";
print "Old in='$in'\n";
print unpack("H*", $in), "\n";
read(STDIN, $in, 3, 2);
print "Now in='$in'\n";
print unpack("H*", $in), "\n";
print "is utf8:", utf8::is_utf8($in) ? 1 : 0, "\n";

and type "abc" as input, you get:

Old in='fþzz'
Now in='fþabc'
is utf8:0

So it counted 2 (full utf8) chars forward in $in, then dropped the utf8
flag and added in the 3 new bytes.

I think this is an unnecessary exposure of the internal format of the old
$in string, and that

     $rc = read($fh, $buf, $len, $off)

should basically behave like:

     $buf = substr($buf, $off);
     # With the current semantics for the raw/unicode-ness of the filehandle:
     $rc = read($fh, my $tmp, $len);
     if ($rc) $buf .= $tmp;

In the above case that would return the same (internal) byte sequence,
but the utf8 flag would be on, and the second char value would be preserved
instead of being expanded to its utf-8 encoding, giving $in the
(to my mind) more logical value "fþabc"

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About