develooper Front page | perl.perl5.porters | Postings from August 2021

Re: "use v5.36.0" should imply UTF-8 encoded source

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
August 2, 2021 17:17
Subject:
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:
152CC0E6-C4D7-49F4-9E0E-3B6F4A2B392D@felipegasper.com


> On Aug 2, 2021, at 11:53 AM, Dan Book <grinnz@gmail.com> wrote:
> 
> On Mon, Aug 2, 2021 at 11:31 AM Felipe Gasper <felipe@felipegasper.com> wrote:
> 
> 
> > On Aug 2, 2021, at 11:17 AM, Veesh Goldman <rabbiveesh@gmail.com> wrote:
> > 
> > 
> > 
> > 
> > My point is still that this:
> > 
> > -----
> > use v5.36;
> > print 'Hello, world!';
> > -----
> > 
> > … should not be “subtly wrong”.
> > 
> > -F
> > 
> > Since 5.36 is meant to turn on warnings, this will be explicitly wrong, not subtly.
> > 
> > Perhaps the "wide character" warning is too unclear, but we can always improve the text to include a doc link as such.
> 
> There’s no “wide character” warning when there happen to be no wide characters.
> 
> > 
> > What compels me more is the following example.
> > Let's say I'm looking for customers in my database named josé. Easy, I'll use DBIC:
> > 
> > $customer_rs->search({ name => 'josé' })
> > 
> > But when I run it, I get nothing. That's because the various DBDs will handle encoding and decoding for you, bc perl is meant to deal with text in userland.
> 
> Which DBDs?
> 
> - DBD::SQLite is bytes by default, but it has the SvPV bug (i.e., it sends the internal PV to SQLite).
> 
> - DBD::mysql is also bytes w/ SvPV bug by default.
> 
> (I haven’t tried DBD::Pg.)
> 
> DBD::mysql has the unicode bug due to long standing issues. DBD::MariaDB was forked for this reason.
> 
> DBD::MariaDB, DBD::SQLite, and DBD::Pg are used with the unicode option in any modern programs. Thus they expect decoded strings.

None of these modules’ documentation says “all new code should enable this”, so if indeed “any modern programs” should be set up that way, it seems a rather cargo-cult-ish thing.

I would say, respectfully, that you yourself are “making a lot of assumptions about other peoples' code”, etc. etc.

>  
> > Had utf8 been turned on, then I would've started with text, not bytes, and found my customers instead of mojibake (though on the other hand, the non utf8 is a great way to find double encoded text).
> > 
> > I think this is a more realistic example than printing a string literal, where the behavior is surprising and conceptually inconsistent.
> 
> Why would you query on a string constant? More likely you’ll be accepting $name via some input, in which case you have to decode it. But if you tried it with a constant you may be confused at why you *didn’t* have to decode it there.
> 
> You are making a lot of assumptions about other peoples' code and thought processes based on your own experience, which is not the way many people approach these problems. And that is why we are considering this; to make the defaults match more people's assumptions.

Making defaults match assumptions is a great thing. I just think newcomers to the language would make assumptions about what `print 'Hello, world!'` does before they reason about DBI etc. Most of those newcomers will hail from JS or Python, where this stuff “just works”.

It basically seems like all the right people are on board with the notion that “Hello, world” in “modern” Perl will look thus:

-----
use v5.36;
use Encode;
print Encode::encode_utf8('Hello, world!');
-----

… and any ensuing explanation will have to discuss character encoding, and the fact that Perl can’t tell text from bytes. Right away this simple example draws attention to one of Perl’s more frustration-prone qualities.

Respectfully, I just can’t see how this improves the language, and I’m surprised more folks aren’t voicing similar thoughts. I’d love to be wrong; I guess we’ll see.

cheers,
-Felipe
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About