On 8/18/21 2:08 PM, Dan Book wrote: > On Wed, Aug 18, 2021 at 3:50 PM Tomasz Konojacki <me@xenu.pl > <mailto:me@xenu.pl>> wrote: > > On Wed, 18 Aug 2021 13:18:34 -0400 > Felipe Gasper <felipe@felipegasper.com > <mailto:felipe@felipegasper.com>> wrote: > > > Per recent IRC discussion … > > > > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source > of confusion regarding the flag’s significance. Some think it > indicates whether a given PV stores text versus binary. Some think > it means that the PV is valid UTF-8. Still others likely hold other > inaccurate views. > > > > The problem here is the naming. For example, consider `perl -e'my > $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the > fact that its code points (assuming use of a UTF-8 terminal) > correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, > however, is likely *not* set on this string. By contrast, consider > `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, > but $foo is NOT a “UTF-8 string” because its code points (in this > case, only 1) aren’t valid UTF-8. > > > > The fact that quite often a “UTF-8 string” lacks the “UTF-8 > flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 > string”, makes little sense except to the “highly initiated”. > > > > Another problem is “UTF-8” doesn’t really describe the “upgraded” > format. This format is what Perl historically called “lax UTF-8” and > is now widely called “generalized UTF-8”, which includes unpaired > surrogates and code points above Unicode’s maximum. > > > > PROPOSAL: Rename the following identifiers in code and > documentation, leaving macros for the old ones as aliases: > > - SVf_UTF8 -> SVf_PVUPGRADED > > - SvUTF8 -> Sv_PVUPGRADED > > - SvUTF8_on -> Sv_PVUPGRADED_on > > - SvUTF8_off -> Sv_PVUPGRADED_off > > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED > > > > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename > because these indicate an actual (if incomplete/invalidated) UTF-8 > decoding step. > > > > BENEFITS: Over time, this rename will minimize the confusion > between Perl’s upgraded-PV storage format versus UTF-8. The rename > may also compel current users of the language who hold mistaken > mental models of the flag’s purpose to reexamine their > understanding, hopefully for the better. > > > > POTENTIAL COMPLICATIONS: The mismatch between amended > documentation and existing documentation may cause confusion; it > should, though, be an auspicious confusion that eventually clarifies > rather than misleads. > > utf8::is_utf8 probably should be renamed too. Anyway, +1 from me. > > Frankly it (and upgrade/downgrade) shouldn't even be in the utf8:: > namespace, it's named that for internal reasons not interface reasons. > > -Dan Upgrade and downgrade tell me nothing. I don't object to renaming, but something better than these needs to be foundThread Previous | Thread Next