develooper Front page | perl.perl5.porters | Postings from August 2021

Re: "use v5.36.0" should imply UTF-8 encoded source

Thread Previous | Thread Next
August 1, 2021 08:24
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:
From the keyboard of shmem [01.08.21,10:01]:

> From the keyboard of Felipe Gasper [31.07.21,20:53]:
> [..]
>> Another way to look at it: the content of the parsed strings actually 
>> differs between the two:
>> my $x = do { no utf8; "éé" };
>> my $y = do { use utf8; "éé" };
>> In the above, $x is a sequence of 4 code points (195, 169, 195, 169), 
>> whereas $y is a sequence of 2 code points (233, 233). That’s it; there is 
>> no other difference between $x and $y. Perl doesn’t know that $x is a “byte 
>> string” and $y is a “character string”; it just knows the code points.
> This actually depends on the utf8-awareness of the editor used to input
> that program text. Entered on a terminal with LANG=en_GB.utf8 via vi, both
> $x and $y are a sequence of 4 code points, the latter with the UTF8 flag
> set which condenses two code points into chr(233). Why? See explanation
> below, and please correct me if I am wrong.

Correcting myself: $y *is* two code points, the internal representation
is 4 bytes. Without the UTF8 flag the internal representation is idem
with code points.

>  PV = 0x9085d0 "\303\251\303\251"\0 [UTF8 "\x{e9}\x{e9}"]
code points ---------------------------------^^^^^^^^^^^^

Sorry for my confusion :-P


_($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                               /\_¯/(q    /
----------------------------  \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About