develooper Front page | perl.perl5.porters | Postings from August 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
Graham Knop
Date:
August 20, 2021 08:42
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
CAM=m89F9Ru6sUtJ_LmEd1bzYVm_n65t4ZcmcmxdxkDuAp1f4mg@mail.gmail.com
On Fri, Aug 20, 2021 at 9:05 AM Sergey Aleynikov
<sergey.aleynikov@gmail.com> wrote:
>
> ср, 18 авг. 2021 г. в 20:17, Felipe Gasper <felipe@felipegasper.com>:
> > The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string
>
> There's no likeness. For literal string, there're deterministic rules
> set (though they may not be documented).
>
> >Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
>
> Maybe I don't understand you, but perl can't have invalid UTF8 in
> literals under 'use utf8'.

But the contents of the string are not "UTF-8". UTF-8 is byte encoding
for Unicode codepoints. From a language perspective (not considering
perl's implementation), the contents of the string is a single
codepoint. It is not a UTF-8 byte sequence.

>
> > PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
>
> Which will only bring more confusion going forward. If you want to
> fight SVf_UTF8 confusion, the problem lies not in it's name, but in
> the logic behind it. You're trying to shove this issue under the rug,
> but what really makes things this messy is this flag's mere existence
> (and it still might be better than Python's choice for theirs Unicode
> strings). -1 from me.
>
> Best regards,
> Sergey Aleynikov
>
> >
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
> >
> > The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.
> >
> > Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.
> >
> > PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> > - SVf_UTF8        -> SVf_PVUPGRADED
> > - SvUTF8          -> Sv_PVUPGRADED
> > - SvUTF8_on       -> Sv_PVUPGRADED_on
> > - SvUTF8_off      -> Sv_PVUPGRADED_off
> > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> >
> > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
> >
> > BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.
> >
> > POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About