develooper Front page | perl.perl5.porters | Postings from February 2022

Re: RFC: Rename the “UTF8” flag

Thread Previous | Thread Next
Felipe Gasper
February 4, 2022 03:05
Re: RFC: Rename the “UTF8” flag
Message ID:

> On Feb 3, 2022, at 17:13, Tony Cook <> wrote:
> On Fri, Jan 28, 2022 at 10:26:31AM -0500, Felipe Gasper wrote:
>> Perl’s “UTF8 flag” confuses Perl users, and even occasionally its
>> maintainers. This RFC proposes to rename it in source code and
>> documentation, retaining old names as aliases to avoid breaking callers
>> of Perl’s C API (XS modules & embedders).
> The UTF8 flag does what it says on the box - indicates the PV is
> encoded using (something like) UTF-8.

Oof. If a pure-Perl user read that description, do you see how that person would reasonably reach for utf8::is_utf8()?

When the PV is encoded in UTF-8 but *doesn’t* have the flag set, what then? Do we say that the PV, which contains UTF-8, is not “encoded using UTF-8”? So, the likely result of utf8::encode() is something that’s not “encoded using UTF-8”??

If you have different terms for different things, then it’s easy to describe this: the flag indicates whether the PV stores a heavy-encoded string, and utf8::encode produces a UTF-8 encoded string. Only the Jedi masters who maintain Perl need worry about what “heavy” really means.

Aside: Do people overestimate the degree of leakage in that abstraction? It seems pretty tight now; if something like Sys::Binmode were in a feature bundle, the documentation could potentially even drop the mention of internal UTF-8 (or relegate it to some special “arcana” section).

> If the documentation is fine and users are ignoring that documentation
> renaming the flag isn't going to help.

Users may not see documentation: they might ignore it, might not know how to find it, etc. When they see a familiar moniker like “UTF8” they’re likely to think, “Hey I know what UTF-8 means … guess that means this is a UTF-8 encoded string!”

If they see “HEAVY”, though, they won’t make that mistake since, like so much else in the lexicon of C-level Perl stuff, its significance is pretty opaque to the uninitiated.

If I may: Tony, Dave, Leon, Yves, and others who have so firmly “downvoted” this idea … when was the last time you explained character encoding in Perl to someone? I wonder how frequently those who dislike this proposal encounter the other meaning of “UTF-8 encoded string”--which normally *won’t* have the UTF8 flag.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About