develooper Front page | perl.perl5.porters | Postings from August 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
August 18, 2021 20:35
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
1A5E0A86-9A17-4C8B-BB5C-7D4CA83319E4@felipegasper.com


> On Aug 18, 2021, at 4:24 PM, Leon Timmermans <fawaka@gmail.com> wrote:
> 
> On Wed, Aug 18, 2021 at 7:17 PM Felipe Gasper <felipe@felipegasper.com> wrote:
> Per recent IRC discussion …
> 
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
> 
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
> 
> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.
> 
> Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.
> 
> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> - SVf_UTF8        -> SVf_PVUPGRADED
> - SvUTF8          -> Sv_PVUPGRADED
> - SvUTF8_on       -> Sv_PVUPGRADED_on
> - SvUTF8_off      -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> 
> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
> 
> BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.
> 
> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.
> 
> I would disagree. Perl code should not have to care/see what the internal encoding is (it's breaking the encapsulation, really), but perl's internals very much do and should care about the internal encoding.

This isn’t really true, though. Pure Perl code also frequently has to care about the internal encoding due to the many instances where Perl itself leaks it.

Example:
-----
perl -Mutf8 -MJSON::PP -e'my $foo = JSON::PP::decode_json( JSON::PP::encode_json(["é"]) )->[0]; exec "echo", $foo'
-----
This *should* print mojibake, but it happens to print “é” because of the leak.

When/if that leaky behaviour gets fixed -- 5.36 feature bundle, maybe? -- then it’ll make more sense to consider the PV encoding a wholly internal matter.

-FG
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About