Front page | perl.perl5.porters |
Postings from February 2022
Re: RFC: Rename the “UTF8” flag
Thread Previous
|
Thread Next
From:
Felipe Gasper
Date:
February 4, 2022 15:41
Subject:
Re: RFC: Rename the “UTF8” flag
Message ID:
43ADF266-4C8A-4AE8-AA91-C88A29C6693A@felipegasper.com
> On Feb 4, 2022, at 09:31, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> On Thu, Feb 3, 2022, at 10:05 PM, Felipe Gasper wrote:
>> Tony wrote:
>> > The UTF8 flag does what it says on the box - indicates the PV is
>> > encoded using (something like) UTF-8.
>>
>> Oof. If a pure-Perl user read that description, do you see how that person would reasonably reach for utf8::is_utf8()?
>
> Just to call out this one point: I think that there's a distinction to be drawn between the SvUTF8 flag and utf8::is_utf8. I get that it's nice to have "is-X" match the "X" flag, but I here, I think it's a bit of a complicated pain in the butt.
>
> That said, I also think that it's utf8::is_utf8 that leads to the mass of confusion. Providing a "is this string stored in internal format A or B" builtin to use _instead_ is a better idea. I would support something more like:
> • provide builtin::internal_string_format that returns 'blue' or 'green'
> • discourage using utf8::is_utf8, explaining "it's not what you think it is"
> • leave the SV flags how they are
Would Internals:: suit it better, since the idea is that Perl applications shouldn’t normally use this?
In that same vein, utf8::upgrade() and utf8::downgrade() could gainfully be renamed to, e.g., Internals::utf8_upgrade() and Internals::utf8_downgrade().
I think utf8::is_utf8() is a symptom. The root problem is the double-whammy that: a) pure-Perl applications *can* need to know Perl’s internals, and b) the same term is used for that internal encoding as for application-level encoding.
From the pure-Perl side, Perl’s internals are only relevant nowadays because exec et al. pass the raw PV to the OS. Compare these:
> perl -Mutf8 -e'print "é"' | xxd
00000000: e9
> perl -Mutf8 -e'exec echo => "-n", "é"' | xxd
00000000: c3a9
If we fixed *that*, via some new feature-bundle-included pragma, then Perl internals would no longer be relevant for pure-Perl devs. I have a PoC CPAN module, Sys::Binmode, that does this by utf8-downgrading all strings prior to giving them to the OS. What if Perl had something like `use syscall::encoding 'bytes'`?
Yes, there’d still be old posts that talk about `use bytes` and what not, but we’d at least be able to say “modern Perl fixes all that; update, and be happy.” Right now we can’t, which makes explaining all of this stuff much trickier than IMO it should be. That complicates Perl advocacy in general, since one of Perl’s “claims to fame” is being a premier text-processing tool.
cheers,
-Felipe
Thread Previous
|
Thread Next