develooper Front page | perl.perl5.porters | Postings from August 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
Sergey Aleynikov
Date:
August 20, 2021 07:05
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
CAKNj8S2DSz1q7RuRRZPp3M34whkkb27vp_YCzv+xGN=GPmKt4w@mail.gmail.com
ср, 18 авг. 2021 г. в 20:17, Felipe Gasper <felipe@felipegasper.com>:
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string

There's no likeness. For literal string, there're deterministic rules
set (though they may not be documented).

>Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.

Maybe I don't understand you, but perl can't have invalid UTF8 in
literals under 'use utf8'.

> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:

Which will only bring more confusion going forward. If you want to
fight SVf_UTF8 confusion, the problem lies not in it's name, but in
the logic behind it. You're trying to shove this issue under the rug,
but what really makes things this messy is this flag's mere existence
(and it still might be better than Python's choice for theirs Unicode
strings). -1 from me.

Best regards,
Sergey Aleynikov

>
> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
>
> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.
>
> Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.
>
> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> - SVf_UTF8        -> SVf_PVUPGRADED
> - SvUTF8          -> Sv_PVUPGRADED
> - SvUTF8_on       -> Sv_PVUPGRADED_on
> - SvUTF8_off      -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>
> BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.
>
> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About