develooper Front page | perl.perl5.porters | Postings from August 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
Dan Book
Date:
August 20, 2021 17:17
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
CABMkAVXdJB8+aR5as0p=st41Oi-S0d-GwgFt3HadU3eM8=8D8A@mail.gmail.com
On Fri, Aug 20, 2021 at 1:06 PM demerphq <demerphq@gmail.com> wrote:

> On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>> Per recent IRC discussion …
>>
>> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>
>> The problem here is the naming. For example, consider `perl -e'my $foo =
>> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
>> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
>> encode “é” in UTF-8.
>
>
> Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
> square/rectangle relationship. All strings are "rectangles", all "squares"
> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
> should assume it is a rectangle, not a square. The SQUARE flag should
> only be set when the rectangle has been proved conclusively to be a square.
> That the SQUARE flag is off does not mean the rectangle is not a square,
> merely that the square has not been proved to be such.
>
>
> The “UTF-8 flag”, however, is likely *not* set on this string. By
>> contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the
>> “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points
>> (in this case, only 1) aren’t valid UTF-8.
>>
>
> Except it is valid UTF-8: (at least in my utf8 terminal).
>
> $ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
> SV = PV(0x153efc0) at 0x155fb38
>   REFCNT = 1
>   FLAGS = (POK,IsCOW,pPOK,UTF8)
>   PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
>   CUR = 2
>   LEN = 10
>   COW_REFCNT = 1
>
> So the string is UTF-8.
>

The premise of this email seems to be about the internals of the string.
That is not the contents of the string (which is "\x{e9}" in this example).
Please re-evaluate in that context.

-Dan

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About