develooper Front page | perl.perl5.porters | Postings from August 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
Dan Book
Date:
August 18, 2021 20:32
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
CABMkAVVf+eXNSuhK_kBj7bVo1dvhL8ef0LsaiWFzCFZWB89tCA@mail.gmail.com
On Wed, Aug 18, 2021 at 4:24 PM Leon Timmermans <fawaka@gmail.com> wrote:

> On Wed, Aug 18, 2021 at 7:17 PM Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>> Per recent IRC discussion …
>>
>> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>
>> The problem here is the naming. For example, consider `perl -e'my $foo =
>> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
>> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
>> encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this
>> string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo
>> has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code
>> points (in this case, only 1) aren’t valid UTF-8.
>>
>> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a
>> “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little
>> sense except to the “highly initiated”.
>>
>> Another problem is “UTF-8” doesn’t really describe the “upgraded” format.
>> This format is what Perl historically called “lax UTF-8” and is now widely
>> called “generalized UTF-8”, which includes unpaired surrogates and code
>> points above Unicode’s maximum.
>>
>> PROPOSAL: Rename the following identifiers in code and documentation,
>> leaving macros for the old ones as aliases:
>> - SVf_UTF8        -> SVf_PVUPGRADED
>> - SvUTF8          -> Sv_PVUPGRADED
>> - SvUTF8_on       -> Sv_PVUPGRADED_on
>> - SvUTF8_off      -> Sv_PVUPGRADED_off
>> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>>
>> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because
>> these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>>
>> BENEFITS: Over time, this rename will minimize the confusion between
>> Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel
>> current users of the language who hold mistaken mental models of the flag’s
>> purpose to reexamine their understanding, hopefully for the better.
>>
>> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and
>> existing documentation may cause confusion; it should, though, be an
>> auspicious confusion that eventually clarifies rather than misleads.
>
>
> I would disagree. Perl code should not have to care/see what the internal
> encoding is (it's breaking the encapsulation, really), but perl's internals
> very much do and should care about the internal encoding.
>
> So to me this logic only makes sense for the perl-visible side of things
> (e.g. utf8::upgrade), not on the C-side.
>

I would agree except that people not working on the internals also have to
use these functions (for XS code), and thus misuse them because they think
they're related to the logical contents of the string.

-Dan

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About