On Wed, 18 Aug 2021 13:18:34 -0400 Felipe Gasper <felipe@felipegasper.com> wrote: > Per recent IRC discussion > > PROBLEM: The naming of Perls UTF-8 flag is a continual source of confusion regarding the flags significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views. > > The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a UTF-8 string by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode é in UTF-8. The UTF-8 flag, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the UTF-8 flag set, but $foo is NOT a UTF-8 string because its code points (in this case, only 1) arent valid UTF-8. > > The fact that quite often a UTF-8 string lacks the UTF-8 flag, and a UTF-8-flagged string is (usually) *not* a UTF-8 string, makes little sense except to the highly initiated. > > Another problem is UTF-8 doesnt really describe the upgraded format. This format is what Perl historically called lax UTF-8 and is now widely called generalized UTF-8, which includes unpaired surrogates and code points above Unicodes maximum. > > PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases: > - SVf_UTF8 -> SVf_PVUPGRADED > - SvUTF8 -> Sv_PVUPGRADED > - SvUTF8_on -> Sv_PVUPGRADED_on > - SvUTF8_off -> Sv_PVUPGRADED_off > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED > > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step. > > BENEFITS: Over time, this rename will minimize the confusion between Perls upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flags purpose to reexamine their understanding, hopefully for the better. > > POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads. utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.Thread Previous | Thread Next