develooper Front page | perl.dbi.dev | Postings from April 2006

Adding utf8 support to DBD::mysql

Thread Next
From:
Tim Bunce
Date:
April 24, 2006 15:05
Subject:
Adding utf8 support to DBD::mysql
Message ID:
20060424215319.GA10410@timac.local
[I'm at the mysql conference and Patrick asked me about adding utf8
support to DBD::mysql. I said I'd look at the libmysql docs and give my
thoughts. I'm posting to dbi-dev since it may be of interest to others
interested in enhancing DBD::mysql and to other driver developers.
These are just random thoughts from a quick look at the docs.]

The keys mysql docs seem to be
http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html

The mysql api and client->server protocol doesn't support passing
characterset info to the server on a per-statement / per-bind value basis.
(http://dev.mysql.com/doc/refman/4.1/en/c-api-prepared-statement-datatypes.html)
So the sane way to send utf8 to the server is by setting the 'connection
character set' to utf8 and then only sending utf8 (or its ASCII subset)
to the server on that connection.

*** Fetching data:

MySQL 4.1.0 added "unsigned int charsetnr" to the MYSQL_FIELD structure.
It's the "character set number for the field".

So set the UTF8 flag based on that value. Something like:
    (field->charsetnr = ???) ? SvUTF8_on(sv) : SvUTF8_off(sv);
I couldn't see any docs for the values of the charsetnr field.

Also, would be good to enable perl code to access the charsetnr values:
    $sth->{mysql_charsetnr}->[$i]

*** Fetching Metadata:

The above is a minimum. It doesn't address metadata like field names
($sth->{NAME}) that might also be in utf8. For that the driver needs to
know if the 'connection character set' is currently utf8.

(The docs mention mysql->charset but it's not clear if that's part of
the public API.)

However it's detected, the code needs to end up doing:
    (...connection charset is utf8...) ? SvUTF8_on(sv) : SvUTF8_off(sv);
on the metadata.


*** SET NAMES '...'

Intercept SET NAMES and call the mysql_set_character_set() API instead.
See http://dev.mysql.com/doc/refman/4.1/en/mysql-set-character-set.html


*** Detecting Inconsistencies

If the connection character set is _not_ utf8 but the application calls
the driver with data (or SQL statement) that has the UTF8 flag set, then
it could issue a warning. In practice that may be to be too noisy for
people that done their own workarounds for utf8 support. If so then
they could be changes to level 1 trace messages.

If the connection character set _is_ utf8, and the application calls
the driver with data (or SQL statement) that does _not_ have the UTF8
flag set but _does_ have bytes with the high bit set, then the driver
should issue a warning. The checking for high bit set is an extra cost
so this should only be enabled if tracing and/or an attribute is set
(perhaps called $dbh->{mysql_charset_checks} = 1)

Tim.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About