develooper Front page | perl.perl5.porters | Postings from February 2022

Re: RFC: Rename the “UTF8” flag

Thread Previous | Thread Next
Felipe Gasper
February 1, 2022 13:58
Re: RFC: Rename the “UTF8” flag
Message ID:

> On Jan 31, 2022, at 16:47, Dave Mitchell <> wrote:
> On Sat, Jan 29, 2022 at 08:14:04AM -0500, Felipe Gasper wrote:
>> The practical advantages are clarity and correctness.
> No, it doesn't do either of those - it just *doubles* the problem for the
> next 20 years or so. Everything which confused XS authors about how perl
> stores large characters *remains*, but now in addition they have to
> remember two sets of nomenclature, and understand that they mean the same
> thing.

With due respect:

They *don’t* mean the same thing. That’s the point.

Under the proposal, “UTF-8 string” can refer solely to things *outside* Perl’s internals. Whether the code points do, in fact, correlate to valid UTF-8 is the application’s concern, not Perl’s. Meanwhile, “heavy string” can refer solely to how Perl stores those code points. The fact that that internal storage happens to resemble UTF-8 is coincidental, and applications don’t need to care about it (Perl’s PV-leak bugs notwithstanding).

It bears repeating: XS authors, like Perl authors, do NOT need to care about how Perl stores strings. SvPV, SvUTF8, and such are the C analogues to abstraction-leaking interfaces that well-written code should avoid. SvPVutf8/SvPVbyte, sv_utf8_decode, and the like achieve the same goals while preserving the abstraction. In fact, since XS authors don’t need Perl’s PV-leaking built-ins (exec, mkdir, &c.), there’s *less* ground for confusion when writing XS than when writing Perl.

We have utf8::upgrade(), utf8::downgrade(), and all the various encode/decode functions--only some of which are legit for Perl applications to use. Then there’s utf8::is_utf8(), which, for pure-Perl code, usually means the *opposite* of what it looks like it means. THIS. IS. MADNESS. No one groks it all without investing *significant* effort. Every time at work I demonstrate Perl’s PV-leak bugs--oftentimes after having diagnosed a breakage--I have to explain all this again--even to folks who’ve written Perl for over 10 years.

Those conversations will be simpler if those internal-to-Perl things can be wrapped up in a nice, specific-to-Perl term that I don’t have to explain, rather than our status quo where it’s like “so there are two kinds of UTF-8 in play, but only one of those is of your concern.” Likewise, XS authors can intuitively avoid weirdly-named things like SvHEAVY_on(), preferring safer APIs with simpler-sounding names like sv_utf8_decode(). The magic space wizards who maintain Perl’s internals can then have SvHEAVY all to themselves and decide whether sv_utf8_decode() should, in fact, just set the SV’s internal flag or not.

And, FWIW, those conversations will also be simpler if the feature bundle would fix the PV leaks, which will avoid the whole matter of Perl internals. :)


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About