Front page | perl.perl5.porters |
Postings from August 2021
Re: "use v5.36.0" should imply UTF-8 encoded source
Thread Previous
|
Thread Next
From:
shmem
Date:
August 1, 2021 08:01
Subject:
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:
alpine.DEB.2.21.2108010826130.4388@mgm-net.de
From the keyboard of Felipe Gasper [31.07.21,20:53]:
[..]
> Another way to look at it: the content of the parsed strings actually differs between the two:
>
> my $x = do { no utf8; "éé" };
> my $y = do { use utf8; "éé" };
>
> In the above, $x is a sequence of 4 code points (195, 169, 195, 169), whereas $y is a sequence of 2 code points (233, 233). That’s it; there is no other difference between $x and $y. Perl doesn’t know that $x is a “byte string” and $y is a “character string”; it just knows the code points.
This actually depends on the utf8-awareness of the editor used to input
that program text. Entered on a terminal with LANG=en_GB.utf8 via vi, both
$x and $y are a sequence of 4 code points, the latter with the UTF8 flag
set which condenses two code points into chr(233). Why? See explanation
below, and please correct me if I am wrong.
Program written with LANG=en_GB.utf8 and its output piped to less(1):
#!/usr/bin/perl
use 5.10.0;
use Devel::Peek;
my $x = do { no utf8; "éé" };
my $y = do { use utf8; "éé" };
my $z = chr(233) x 2;
$| = 1;
say "\$x: ",$x; Dump $x;
say "\$y: ",$y; Dump $y;
say "\$z: ",$z; Dump $z;
__END__
$x: éé
SV = PV(0x8debc0) at 0x904530
REFCNT = 1
FLAGS = (PADMY,POK,IsCOW,pPOK)
PV = 0x908630 "\303\251\303\251"\0
CUR = 4
LEN = 10
COW_REFCNT = 1
$y: <E9><E9>
SV = PV(0x8dec40) at 0x904548
REFCNT = 1
FLAGS = (PADMY,POK,IsCOW,pPOK,UTF8)
PV = 0x9085d0 "\303\251\303\251"\0 [UTF8 "\x{e9}\x{e9}"]
CUR = 4
LEN = 10
COW_REFCNT = 1
$z: <E9><E9>
SV = PV(0x8dea90) at 0x904b18
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x8f5de0 "\351\351"\0
CUR = 2
LEN = 10
This had me confused all the time: why does an utf8 literal with the UTF8
flag set result in an ISO-8859 sequence? That's because the utf8 feature
was introduced in times when terminals defaulted to some latin-1 variant
and allowed use of UTF-8 which resulted in the appropriate latin-1 string
representation. Now that terminals, editors and such pretty always default
to using UTF-8, the utf8 pragma is meaningless except for weird cases in
which you want your literals be treated as latin-1.
Entering the above program text in a terminal with LANG=en_GB.ISO-8859-1
produces the following:
Malformed UTF-8 character (unexpected non-continuation byte 0xe9, immediately after start byte 0xe9) at utf8-iso.pl line 5.
Malformed UTF-8 character (1 byte, need 3, after start byte 0xe9) at utf8-iso.pl line 5.
$x: éé
SV = PV(0x203dbc0) at 0x2063630
REFCNT = 1
FLAGS = (PADMY,POK,IsCOW,pPOK)
PV = 0x20678b0 "\351\351"\0
CUR = 2
LEN = 10
COW_REFCNT = 1
$y: ^@^@
SV = PV(0x203dc40) at 0x2063648
REFCNT = 1
FLAGS = (PADMY,POK,IsCOW,pPOK,UTF8)
PV = 0x2067850 "\0\0"\0 [UTF8 "\x{0}\x{0}"]
CUR = 2
LEN = 10
COW_REFCNT = 1
$z: éé
SV = PV(0x203da90) at 0x2063cf0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2054e90 "\351\351"\0
CUR = 2
LEN = 10
$s: éé
SV = PV(0x203dc50) at 0x2063ca8
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2042360 "\351\351"\0
CUR = 2
LEN = 10
So, to procude "éé" in $y, line 5 should be - in an ISO-8859 or latin-1
environment - proper written as
my $y = do { use utf8; "éé" };
because then the literal is valid UTF-8 expressed in latin-1.
> This would, I think, easily be the most disruptive, potentially “surprising” change yet introduced to a feature bundle.
>
> -FG
I agree. And as said above, the utf8 pragma is useless most all of the
time and people get its effect backwards, since nowadays most work in a
UTF-8 aware environment.
If you write your programs in an UTF-8 environment and get your output to
the same, perl already does the right thing, no matter whether you output
bytes or characters, because those bytes actually resemble valid UTF-8.
In an UTF-8 environment perl already does the right thing reading your
program. Characters vs. bytes gets interesting in regexes, but that's a
well covered area, and then there's substr, chop/chomp for which detection
or explicit of bytes vs. chars makes sense.
0--gg-
--
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Thread Previous
|
Thread Next