develooper Front page | perl.perl5.porters | Postings from January 2022

Re: RFC: Rename the “UTF8” flag

Thread Previous | Thread Next
Felipe Gasper
January 29, 2022 13:14
Re: RFC: Rename the “UTF8” flag
Message ID:

> On Jan 28, 2022, at 21:53, Leon Timmermans <> wrote:
> A change like this would involve a large amount of change in our code (and more functions/macros than just the ones you mention), as well as complicating our documentation (because suddenly all these things have two names). A change like that would need significant benefit to be worth all that headache.

Only one name, I propose. The other will remain as an internal alias, probably mentioned somewhere in the documentation, but not prominently.

> And I don't see that benefit. As far as I can tell, the argument is really more about ideological purity than any practical advantages.

The practical advantages are clarity and correctness. Around $work I’m “the guy who understands encodings”, and it takes a *long* time to explain this stuff. Having terminology that properly differentiates Perl-internal encoding from Perl-caller-visible “UTF-8” will manifoldly simplify those discussions.

Even for language maintainers, though, I think it’ll help. Note that interfaces like “SvPVutf8” will *not* be renamed; this is because there is no need there: that interface respects the code point storage abstraction and so is fine as-is. Likewise sv_utf8_decode() and sv_utf8_encode().

It will thus be easier for everyone--API maintainers as well as callers--to distinguish the external-facing stuff (“utf-?8”) from the internal-facing (“heavy”). Part of this could include, actually, postfixing descriptions of SvHEAVY with a disclaimer about the abstraction leak that it entails.

> One of the two internal encodings of Perl is (non-strict) UTF-8, and almost any code dealing with this distinction will have to have knowledge of what encoding is being used. Calling it "wide" or "heavy" might have made sense if we could hide that, but that's exactly what we can't do.
> This sort of thing makes sense in Perl-land, because we very much do want to hide internal encoding there. But in C-land we can't, and we shouldn't pretend otherwise.

Would you mind explaining further where/when it is necessary to probe the string-storage abstraction?

In my own experience, FWIW, it is entirely possible when using Perl’s C API to avoid assumptions about how an SV stores its code points internally. Yes, you can call SvUTF8* macros and such, but the safer approach of doing actual encode/decode operations and SvPVbyte/SvPVutf8 preserves the abstraction.

Assuming my experience is “legitimate”, and it is, in fact, entirely possible in C-land to ignore Perl’s internal encoding, does your opinion change?


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About