develooper Front page | perl.perl5.porters | Postings from February 2022

Re: RFC: Rename the “UTF8” flag

Thread Previous | Thread Next
Felipe Gasper
February 1, 2022 21:11
Re: RFC: Rename the “UTF8” flag
Message ID:

> On Feb 1, 2022, at 14:58, Karl Williamson <> wrote:
> I hate the word "upgraded" for our uses
> From
> "upgraded  adjective
> improved by the addition or replacement of components; raised to a higher standard."
> Just what is it about a UTF8-encoded string that makes it better than a non- one?  What is it about an SVt_PV makes it better than an SVt_IV?

^^ Case in point. “UTF8-encoded string” here does not mean what Perl users would legitimately call a “UTF8-encoded string”. Anyone who doesn’t maintain Perl’s internals may well think Karl is talking here about a string that utf8::encode() or similar gave … which likely would *not* have its UTF8 flag set.

For folks like myself, who mostly do Perl/CPAN/XS but dabble in Perl internals, it’s pretty jarring.

There’s much less chance of misunderstanding if the term to describe internal-UTF8 does not overlap with Perl users’ (and XS modules’) legitimate understanding of “UTF-8”. I think this is so even for Perl maintainers, since they do need to interact with CPAN et al. and will thus encounter the phrase “UTF8-encoded” in that context.

> We use this word to mean something different than its standard usage. That lowers efficiency of maintenance.  It still causes me pause whenever I see these on-English uses.  We may be stuck with the poor choices of wording that were made earlier in the project; but we shouldn't add more misery either.
> "Heavy" in my experience has been used to mean something that is complicated and/or slow that we strive to avoid when possible.  So, before it was removed, was for the heavy lifting of going out to disk to gather the necessary data, and we played games to defer it until absolutely necessary.

Perl does try to avoid “heavy” internal storage when possible: sv_utf8_decode() only adds the UTF8 flag if there are UTF-8-variant characters. "\xe9" in source code will be stored non-heavy. Perl avoids heavy/upgraded/UTF8 when possible for good reason: it eats more memory and is slower and more complicated to parse than an equivalent Latin-1 PV. It makes sv_dump()’s output more complex. It is “heavier”.

> I also don't buy the argument that adding synonyms doubles the cognitive load.  A new good word that the core converts to drives out use of the old worse one.  Newcomers may never come across the old word, and have the advantage of something that isn't misleading.  The catch is the new word must be clearly better.

The fact that any change will likely remove the ambiguity makes that, IMO, a fairly low bar. And per your reflection on the word “heavy” above, the rename I propose seems pretty apt, though TBH I wouldn’t object to “upgraded” (as Dan suggests), either.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About