develooper Front page | perl.dbi.dev | Postings from May 2006

Re: Adding utf8 support to DBD::mysql

Thread Previous | Thread Next
From:
Charles Jardine
Date:
May 1, 2006 03:21
Subject:
Re: Adding utf8 support to DBD::mysql
Message ID:
4455E14E.1000304@cam.ac.uk
Tim Bunce wrote:
> [I'm at the mysql conference and Patrick asked me about adding utf8
> support to DBD::mysql.]

  [snip]

> *** Detecting Inconsistencies
> 
> If the connection character set is _not_ utf8 but the application calls
> the driver with data (or SQL statement) that has the UTF8 flag set, then
> it could issue a warning. In practice that may be to be too noisy for
> people that done their own workarounds for utf8 support. If so then
> they could be changes to level 1 trace messages.
> 
> If the connection character set _is_ utf8, and the application calls
> the driver with data (or SQL statement) that does _not_ have the UTF8
> flag set but _does_ have bytes with the high bit set, then the driver
> should issue a warning. The checking for high bit set is an extra cost
> so this should only be enabled if tracing and/or an attribute is set
> (perhaps called $dbh->{mysql_charset_checks} = 1)

Tim,

You don't explicitly say what you are proposing should be done with
the anomalous data. I guess, by analogy with the behaviour of
DBD::Oracle's handling of SQL statements, that the implicit proposal
is to pass the octets of the anomalous string unchanged across the
connection. This will result in octet strings which perl has flagged
as being utf8-encoded being passed over connections which expect
byte encoding, and vice versa.

I think that this is wrong as the default behaviour for a DBD, and
I would be sorry to see another DBD converted to behave in this way.

The default behaviour I would like to see is as follows:

If a utf8-flagged string is presented for transmission over a
byte-encoded connection, an attempt should be made to downgrade
the string to byte encoding. This will fail if the string contains
characters with codepoints > 255. Such failure should be treated
as an error.

If a string without the utf8 flag is presented for transmission
across a utf8-encoded connection, it should simply be upgraded
to utf8 encoding. This cannot fail.

I am aware that a DBD which does not automatically upgrade and
downgrade may provide a useful compatibility bridge for programs
originally written to cope with DBDs without Unicode support.
However, such DBDs are not compatible with the spirit of
perldoc perluniintro and perldoc perlunicode. To quote from the
former:

>      o   How Do I Know Whether My String Is In Unicode?
> 
>          You shouldn't care.  No, you really shouldn't.  No,
>          really.  If you have to care--beyond the cases described
>          above--it means that we didn't get the transparency of
>          Unicode quite right.
> 
>          Okay, if you insist: [...]

A DBD which does not handle upgrading and downgrading itself
doesn't get the transparency quite right. The writer of a program
using such a driver has to care about the utf8 flag, since
strings which compare equal in perl, but differ in the setting
of the flag, will produce different results when processed by
the DBD. This ought not to be the way of the future.

-- 
Charles Jardine - Computing Service, University of Cambridge
cj10@cam.ac.uk    Tel: +44 1223 334506, Fax: +44 1223 334679


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About