develooper Front page | perl.perl5.porters | Postings from August 2021

Re: "use v5.36.0" should imply UTF-8 encoded source

Thread Previous | Thread Next
From:
Dan Book
Date:
August 2, 2021 21:39
Subject:
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:
CABMkAVVhqo05cXt2k3HxDVUZBNXJyus9mvVxiBE_F2aXtZHpdA@mail.gmail.com
On Mon, Aug 2, 2021 at 5:32 PM Dan Book <grinnz@gmail.com> wrote:

> On Mon, Aug 2, 2021 at 4:28 PM Harald Jörg <haj@posteo.de> wrote:
>
>> Dan Book <grinnz@gmail.com> writes:
>>
>> > DBD::MariaDB, DBD::SQLite, and DBD::Pg are used with the unicode
>> > option in any modern programs. Thus they expect decoded strings.
>>
>> As far as DBD::SQLite is concerned, this is only half-true.  In the
>> current version 1.70 there have been changes how to declare unicode
>> handling, but even with DBD_SQLITE_STRING_MODE_UNICODE_STRICT you can
>> feed it UTF-8 encoded byte sequences and it "just works" (but maybe
>> shouldn't).
>>
>> You see the downside of this when you have a non-ASCII literal in a
>> iso-latin-1 encoded Perl source (e.g. "ä" or "\x{e4}").  For Perl, it is
>> the same character as "\N{LATIN SMALL LETTER A WITH DIAERESIS}", but if
>> you feed both to the database you get different results.
>>
>
> I don't think this is correct. Mojo::SQLite has many tests to ensure in
> unicode-mode that it treats strings consistently.
>
>
>> Veesh could change his source (if in a latin-1 encoded file)
>>     $customer_rs->search({ name => 'josé' })
>> to
>>     $customer_rs->search({ name => decode('iso-8859-1','josé') })
>> to make it work.
>>
>
> This code makes no difference, decoding from iso-8859-1 is a no-op in Perl
> strings (aside from considering "bytes" outside the single-byte encoding
> range as errors/replacement characters).
>
>
>> It seems that the driver still inspects the infamous UTF-8-flag to
>> decide whether a literal is encoded or not.
>>
>
> This is not the case.
>
> use strict;
> use warnings;
> use DBD::SQLite;
> use DBD::SQLite::Constants ':dbd_sqlite_string_mode';
>
> my %options = (RaiseError => 1, AutoInactiveDestroy => 1,
> sqlite_string_mode => DBD_SQLITE_STRING_MODE_UNICODE_FALLBACK);
> my $db = DBI->connect('dbi:SQLite:dbname=:memory:', undef, undef,
> \%options);
>
> my $str = "\xe4";
>
> utf8::downgrade $str;
> printf "%vX (length: %d)\n", $db->selectrow_array('SELECT ?, length(?)',
> undef, $str, $str);
> # prints: E4 (length: 1)
>
> utf8::upgrade $str;
> printf "%vX (length: %d)\n", $db->selectrow_array('SELECT ?, length(?)',
> undef, $str, $str);
> # prints: E4 (length: 1)
>

And for completeness if you do the same test with the UTF-8 encoded bytes
"\xc3\xa4" you get consistent results as well:  C3.A4 (length: 2)

-Dan

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About