develooper Front page | perl.dbi.dev | Postings from April 2006

RE: Adding utf8 support to DBD::mysql

Thread Previous | Thread Next
From:
Martin J. Evans
Date:
April 30, 2006 11:10
Subject:
RE: Adding utf8 support to DBD::mysql
Message ID:
XFMail.20060430190741.martin.evans@easysoft.com
I hope I'm not muddying the waters but dbd::mysql UTF support for returned
data (not metadata) seems to be nearly there. I need UTF8 support on
at least inserted and returned results-sets although I'm less bothered
by UTF8 table/column names. It would seem that if you define your table as
using UTF8 then insertion is not a problem but retrieval is. The
following code nearly works - it is just the setting of the utf8 flag on
the returned data that is wrong:

#!/usr/bin/perl -w
use strict;
use DBI qw(:utils);
use charnames ':full';
use Encode;

print "Is utf8::is_utf8 defined: ", defined &utf8::is_utf8, "\n";
print "Is utf8::valid defined: ", defined &utf8::valid, "\n";
my $str = "\x{263a}xxx" . chr(0x05d0) . "\N{ARABIC LETTER ALEF}"; # smiley
print join(" ", unpack("H*", $str)), "\n";
print "length(str) = ", length($str), "\n";
print "bytes::length(str) = ", bytes::length($str), "\n";

print "utf8::is_utf8 = ", utf8::is_utf8($str) ? 1 : 0, "\n";
print "data_string_desc: ", data_string_desc($str),"\n";

open OUT, ">uni.out";
binmode(OUT, ":utf8");
print OUT "$str\n";
# data written to uni.out is UTF8

my $dbh = DBI->connect("dbi:mysql:test", "xxx", "xxx");
# there are posts on dbi-user as to whether both or either of
# the following should be set
$dbh->do("set character set utf8");
$dbh->do("set names utf8");
$dbh->do("drop table if exists utf");
$dbh->do("create table utf (a char(100)) default charset utf8");
my $sth = $dbh->prepare("insert into utf values (?)");
$sth->execute($str);
$sth = $dbh->prepare("select * from utf");
$sth->execute;

my @row = $sth->fetchrow_array;
print "data_string_desc (after fetch): ", data_string_desc($row[0]),"\n";
# the following shows we'e got the right data back
# but perl does not know it is utf8
print join(" ", unpack("H*", $row[0])), "\n";
# turning on utf8 causes the rignt uf8 sequence to be output
# and hence sv_utf8_upgrade(sv) will probably work
Encode::_utf8_on($row[0]);
print "data_string_desc (after fetch): ", data_string_desc($row[0]),"\n";

open OUT, ">utf.out";
binmode (OUT, ":utf8");
print OUT $row[0];
close OUT;
# data written to utf.out is not UTF8 unless is marked utf8

produces:

Is utf8::is_utf8 defined: 1
Is utf8::valid defined: 1
e298ba787878d790d8a7
length(str) = 6
bytes::length(str) = 10
utf8::is_utf8 = 1
data_string_desc: UTF8 on, non-ASCII, 6 characters 10 bytes
data_string_desc (after fetch): UTF8 off, non-ASCII, 10 characters 10 bytes
e298ba787878d790d8a7

with utf.out containing:

<C3><A2><C2><98><C2><BA>xxx<C3><97><C2><90><C3><98><C2><A7>

without that Encode::_utf8_on($row[0]);

Michael Kr´┐Żll (apr-10-2006)
posted a change for DBD::db2 which seemed to sort this out for DB2
so a similar change could be added to mysql.

Hope this helps.

Martin
--
Martin J. Evans
Easysoft Ltd, UK
http://www.easysoft.com


On 24-Apr-2006 Tim Bunce wrote:
> [I'm at the mysql conference and Patrick asked me about adding utf8
> support to DBD::mysql. I said I'd look at the libmysql docs and give my
> thoughts. I'm posting to dbi-dev since it may be of interest to others
> interested in enhancing DBD::mysql and to other driver developers.
> These are just random thoughts from a quick look at the docs.]
> 
> The keys mysql docs seem to be
> http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html
> 
> The mysql api and client->server protocol doesn't support passing
> characterset info to the server on a per-statement / per-bind value basis.
> (http://dev.mysql.com/doc/refman/4.1/en/c-api-prepared-statement-datatypes.htm
> l)
> So the sane way to send utf8 to the server is by setting the 'connection
> character set' to utf8 and then only sending utf8 (or its ASCII subset)
> to the server on that connection.
> 
> *** Fetching data:
> 
> MySQL 4.1.0 added "unsigned int charsetnr" to the MYSQL_FIELD structure.
> It's the "character set number for the field".
> 
> So set the UTF8 flag based on that value. Something like:
>     (field->charsetnr = ???) ? SvUTF8_on(sv) : SvUTF8_off(sv);
> I couldn't see any docs for the values of the charsetnr field.
> 
> Also, would be good to enable perl code to access the charsetnr values:
>     $sth->{mysql_charsetnr}->[$i]
> 
> *** Fetching Metadata:
> 
> The above is a minimum. It doesn't address metadata like field names
> ($sth->{NAME}) that might also be in utf8. For that the driver needs to
> know if the 'connection character set' is currently utf8.
> 
> (The docs mention mysql->charset but it's not clear if that's part of
> the public API.)
> 
> However it's detected, the code needs to end up doing:
>     (...connection charset is utf8...) ? SvUTF8_on(sv) : SvUTF8_off(sv);
> on the metadata.
> 
> 
> *** SET NAMES '...'
> 
> Intercept SET NAMES and call the mysql_set_character_set() API instead.
> See http://dev.mysql.com/doc/refman/4.1/en/mysql-set-character-set.html
> 
> 
> *** Detecting Inconsistencies
> 
> If the connection character set is _not_ utf8 but the application calls
> the driver with data (or SQL statement) that has the UTF8 flag set, then
> it could issue a warning. In practice that may be to be too noisy for
> people that done their own workarounds for utf8 support. If so then
> they could be changes to level 1 trace messages.
> 
> If the connection character set _is_ utf8, and the application calls
> the driver with data (or SQL statement) that does _not_ have the UTF8
> flag set but _does_ have bytes with the high bit set, then the driver
> should issue a warning. The checking for high bit set is an extra cost
> so this should only be enabled if tracing and/or an attribute is set
> (perhaps called $dbh->{mysql_charset_checks} = 1)
> 
> Tim.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About