develooper Front page | perl.dbi.dev | Postings from May 2006

Re: Adding utf8 support to DBD::mysql

Thread Previous | Thread Next
From:
Martin J. Evans
Date:
May 15, 2006 06:55
Subject:
Re: Adding utf8 support to DBD::mysql
Message ID:
XFMail.20060515145302.martin.evans@easysoft.com
Tim, Patrick and all,

On 15-May-2006 Tim Bunce wrote:
> On Sat, May 13, 2006 at 08:41:33PM +0100, Martin J. Evans wrote:
>> Hi,
>> 
>> This is really a progress report - some of it I've already sent to Patrick.
>> 
>> Attached is a patch to DBD::mysql 3.0003_1 which:
>> 
>> o fixes a few compile issues with were caused by declarations
>>   in the middle of a block - already sent to Patrick.
>> 
>> o fixes the memory corruption I previously reported on dbi-users.
>>   Sent to Patrick but the initial fix had problems - I've resent
>>   the changes to Patrick.
>> 
>> o introduces a small change to get utf8 data back
>>   All this does is check charsetnr to see if it is 33 and
>>   turns on the utf8 flag if it is.
>>   You still need to do one or both of:
>> 
>> $dbh->do("set character set utf8");
>> $dbh->do("set names utf8");
>> 
>>   to get utf8 back and even then you only get it back if the
>>   column is defined as utf8 in mysql.
> 
> That's not ideal. It should be possible to set the 'connection charset'
> to utf8 so all charsets are auto-converted to utf8 my mysql. That would
> be *much* more useful.

Yes, appologies - a little more investigation shows that you do get
charsetnr=33 (utf8) back for ascii strings too or columns not defined with
default charset of utf8 if the above "set" statements are run. e.g.

#!/usr/bin/perl -w
use strict;
use DBI qw(:utils);
use charnames ':full';
use Encode;

print "Is utf8::is_utf8 defined: ", defined &utf8::is_utf8, "\n";
print "Is utf8::valid defined: ", defined &utf8::valid, "\n";
my $str = "\x{263a}xxx" . chr(0x05d0) . "\N{ARABIC LETTER ALEF}"; # smiley
print join(" ", unpack("H*", $str)), "\n";
print "length(str) = ", length($str), "\n";
print "bytes::length(str) = ", bytes::length($str), "\n";

print "utf8::is_utf8 = ", utf8::is_utf8($str) ? 1 : 0, "\n";
print "data_string_desc: ", data_string_desc($str),"\n";

open OUT, ">uni.out";
binmode(OUT, ":utf8");
print OUT "$str\n";
# data written to uni.out is UTF8

my $dbh = DBI->connect("dbi:mysql:test", "xxx", "yyy");
# there are posts on dbi-user as to whether both or either of
# the following should be set
$dbh->do("set character set utf8");
$dbh->do("set names utf8");
$dbh->do("drop table if exists utf");
$dbh->do("create table utf (a char(100)) default charset utf8");
my $sth = $dbh->prepare("insert into utf values (?)");
$sth->execute($str);
$sth = $dbh->prepare("select * from utf");
$sth->execute;

my @row = $sth->fetchrow_array;
print "data_string_desc (after fetch): ", data_string_desc($row[0]),"\n";
print join(" ", unpack("H*", $row[0])), "\n";
print "data_string_desc (after fetch): ", data_string_desc($row[0]),"\n";

open OUT, ">utf2.out";
binmode (OUT, ":utf8");
print OUT $row[0];
close OUT;

$dbh->do("drop table if exists utf2");
$dbh->do("create table utf2 (a char(100))");
$sth = $dbh->prepare("insert into utf2 values (?)");
$sth->execute("hello");
$sth = $dbh->prepare("select * from utf2");
$sth->execute;
@row = $sth->fetchrow_array;
print "data_string_desc (after fetch): ", data_string_desc($row[0]),"\n";

returns:

Is utf8::is_utf8 defined: 1
Is utf8::valid defined: 1
e298ba787878d790d8a7
length(str) = 6
bytes::length(str) = 10
utf8::is_utf8 = 1
data_string_desc: UTF8 on, non-ASCII, 6 characters 10 bytes
data_string_desc (after fetch): UTF8 on, non-ASCII, 6 characters 10 bytes
e298ba787878d790d8a7
data_string_desc (after fetch): UTF8 on, non-ASCII, 6 characters 10 bytes
data_string_desc (after fetch): UTF8 on, ASCII, 5 characters 5 bytes

i.e. the create table (without default charset utf8) and containing ascii data
is described as UTF8 on, ASCII (with the previous patch I posted).
 
I have not tried creating non-ascii chrset table data and fetching that back as
utf8 yet. If someone has a small example of creating non-ascii, non-utf8 table
and data in perl for mysql that I can run on linux in a normal xterm I'll try
it.

>>   I'm not clear exactly which of the above 2 settings you need to make
>>   as there is contradictory evidence on the net about this.
> 
> This needs to be cleared up and the effect of each documented.

Agree.

I located the following documentation:

=====
There are two statements that affect the connection character sets:

SET NAMES 'charset_name'
SET CHARACTER SET charset_name

SET NAMES indicates what character set the client will use to send SQL
statements to the server. Thus, SET NAMES 'cp1251' tells the server "future
incoming messages from this client are in character set cp1251." It also
specifies the character set that the server should use for sending results back
to the client. (For example, it indicates what character set to use for column
values if you use a SELECT statement.)

A SET NAMES 'x' statement is equivalent to these three statements:

SET character_set_client = x;
SET character_set_results = x;
SET character_set_connection = x;

Setting character_set_connection to x also sets collation_connection to the
default collation for x.

SET CHARACTER SET is similar to SET NAMES but sets the connection character set
and collation to be those of the default database. A SET CHARACTER SET x
statement is equivalent to these three statements:

SET character_set_client = x;
SET character_set_results = x;
SET collation_connection = @@collation_database;
=====

but users in mailing lists reported differing results to that described.

Personally, I don't want to have to issue one or the other (or both) of these
"set" statements when I connect. The Perl unicode documentation goes on about
Perl modules should always just do the right thing with unicode but I don't see
how you do that here unless some sort of switch (perhaps a dbd::mysql connect
attribute). i.e. I'd rather do
DBI->connect(dbd:mysql:database;character_set=utf8) (or something like that)
than have to issue a do with the charset set in each time I connect.

>>   I've tested this a little - so far as my utf8 defined tables
>>   which contain utf8 data come back fine and none uf8 columns
>>   are untouched.
>> 
>> If anyone else fancies testing this or helping with information
>> about what you need to set (character set or names) I'd be pleased
>> to hear about it. My primary interest is fairly selfish in that I
>> need utf8 support but I'm happy to share what I find/change but I
>> also don't want to step on Patrick's shoes - as such this is cc'ed to
>> Patrick (and I also appologise for the first attempt I sent
>> you for fixing the memory corruption - my mistake - I sent an
>> out of date version).
>> 
>> I should probably also point out that this patch is against
>> a development release of dbd::mysql and all that implies.
>> 
>> There was also a mail from Henri on dbi-users about using TYPE=>SQL_INTEGER
>> in bind_param call - I don't do this and can't (currently from the existing
>> code) see how it would work.
>> 
>> As an aside, if anyone knows a definite list of charsetnr values
>> in mysql I would be interested in it as my patch is based on
>> what I've seen rather than what I've read anywhere.

A quick bit of C code calling get_charset with all values up to 1024 produced:

3 = dec8
4 = cp850
5 = latin1
6 = hp8
7 = koi8r
8 = latin1
9 = latin2
10 = swe7
11 = ascii
14 = cp1251
15 = latin1
16 = hebrew
20 = latin7
21 = latin2
22 = koi8u
23 = cp1251
25 = greek
26 = cp1250
27 = latin2
29 = cp1257
30 = latin5
31 = latin1
32 = armscii8
33 = utf8
36 = cp866
37 = keybcs2
38 = macce
39 = macroman
40 = cp852
41 = latin7
42 = latin7
43 = macce
44 = cp1250
47 = latin1
48 = latin1
49 = latin1
50 = cp1251
51 = cp1251
52 = cp1251
53 = macroman
57 = cp1256
58 = cp1257
59 = cp1257
63 = binary
64 = armscii8
65 = ascii
66 = cp1250
67 = cp1256
68 = cp866
69 = dec8
70 = greek
71 = hebrew
72 = hp8
73 = keybcs2
74 = koi8r
75 = koi8u
77 = latin2
78 = latin5
79 = latin7
80 = cp850
81 = cp852
82 = swe7
83 = utf8
92 = geostd8
93 = geostd8
94 = latin1
192 = utf8
193 = utf8
194 = utf8
195 = utf8
196 = utf8
197 = utf8
198 = utf8
199 = utf8
200 = utf8
201 = utf8
202 = utf8
203 = utf8
204 = utf8
205 = utf8
206 = utf8
207 = utf8
208 = utf8
209 = utf8

on my machine (all other numbers were undefined)

What worried me was the multiple values returning "utf8".


> Patrick (or someone), please file a doc bug with mysql to get them to
> document the values of charsetnr.
> 
> Thanks Martin.

I attempted to do that here:
http://bugs.mysql.com/bug.php?id=19814

I have not specifically looked at utf8 in the metadata yet e.g. table/column
names and probably won't until I can find out more about all those different
charsetnr numbers returning "utf8". However, I do notice you can
create a table with a utf8 string name and DBI->tables will also return table
names with utf8 on and ASCII or non-ASCII.

e.g.

my $tablename="\x{263a}xxx" . chr(0x05d0) . "\N{ARABIC LETTER ALEF}";
print "data_string_desc: ", data_string_desc($tablename),"\n";

$dbh->do("drop table if exists $tablename");
$dbh->do("create table $tablename (a char(100))");

my @t = $dbh->tables('test_bet', undef, undef, undef);
foreach my $t (@t) {
    #print data_string_desc($t),"\n";
    if (data_string_desc($t) =~ /non/) {
        print join(" ", unpack("H*", $t)), "\n";
    }
}

prints

60e298ba787878d790d8a760

correctly (the 60s are `)
(Note, MySQL GUI browser displays that table name as "|xxx||").

Martin
--
Martin J. Evans
Easysoft Ltd, UK
http://www.easysoft.com


> Tim.
> 
>> I did try
>> adding get_charset calls to dbdimp.c for DBD::mysql to find
>> the charset name but to do this I needed to add a load of #includes
>> to dbdimp.h which caused a lot of compiler warnings. This was
>> based on code I found at http://bugs.mysql.com/bug.php?id=6911.
>> 
>> Martin
>> --
>> Martin J. Evans
>> Easysoft Ltd, UK
>> http://www.easysoft.com
>> 
>> 
>> On 04-May-2006 Martin J. Evans wrote:
>> > Tim,
>> > 
>> > On 04-May-2006 Tim Bunce wrote:
>> >> On Sun, Apr 30, 2006 at 01:36:04PM -0700, Patrick Galbraith wrote:
>> >>> Martin J. Evans wrote:
>> >>> 
>> >>> Martin,
>> >>> 
>> >>> Thanks much! This is dbdimp.c, right? I  will add this tomorrow (not 
>> >>> working today), and test it out.
>> > 
>> >  
>> >> Please don't use only is_high_bit_set() to enable UTF8.  That'll break
>> >> any code that is storing non-utf8 data that happens to have the high-bit
>> >> set.
>> >> 
>> >> Please make sure the test cases cover this situation. It's not enough
>> >> to get 'utf8 working' its also important to not break existing code.
>> >> 
>> >> Using the 'charsetnr' value (see below) looks far more correct. That way
>> >> perl will treat the values as UTF8 only if mysql was treating it as UTF8.
>> > 
>> > Sorry, I should have made it clearer it was only a demonstration that utf8
>> > can
>> > work with mysql as someone had been asking that. I had already told
>> > Patrick
>> > that off the list. I fully realised that hack would break 8 bit chrsets.
>> > 
>> > I have already started looking at charsetnr but have run into a number of
>> > issues due to the way charsetnr has changed over different versions of
>> > mysql.
>> > 
>> > Martin
>> > --
>> > Martin J. Evans
>> > Easysoft Ltd, UK
>> > http://www.easysoft.com
>> > 
>> > 
>> >>> >>>The keys mysql docs seem to be
>> >>> >>>http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html
>> >>> >>>
>> >>> >>>The mysql api and client->server protocol doesn't support passing
>> >>> >>>characterset info to the server on a per-statement / per-bind value 
>> >>> >>>basis.
>> >>> >>>(http://dev.mysql.com/doc/refman/4.1/en/c-api-prepared-statement-datat
>> >>> >>>ype
>> >>> >>>s
>> >>> >>>.html)
>> >>> >>>So the sane way to send utf8 to the server is by setting the
>> >>> >>>'connection
>> >>> >>>character set' to utf8 and then only sending utf8 (or its ASCII
>> >>> >>>subset)
>> >>> >>>to the server on that connection.
>> >>> >>>
>> >>> >>>*** Fetching data:
>> >>> >>>
>> >>> >>>MySQL 4.1.0 added "unsigned int charsetnr" to the MYSQL_FIELD
>> >>> >>>structure.
>> >>> >>>It's the "character set number for the field".
>> >>> >>>
>> >>> >>>So set the UTF8 flag based on that value. Something like:
>> >>> >>>   (field->charsetnr = ???) ? SvUTF8_on(sv) : SvUTF8_off(sv);
>> >>> >>>I couldn't see any docs for the values of the charsetnr field.
>> >>> >>>
>> >>> >>>Also, would be good to enable perl code to access the charsetnr
>> >>> >>>values:
>> >>> >>>   $sth->{mysql_charsetnr}->[$i]
>> >>> >>>
>> >>> >>>*** Fetching Metadata:
>> >>> >>>
>> >>> >>>The above is a minimum. It doesn't address metadata like field names
>> >>> >>>($sth->{NAME}) that might also be in utf8. For that the driver needs
>> >>> >>>to
>> >>> >>>know if the 'connection character set' is currently utf8.
>> >>> >>>
>> >>> >>>(The docs mention mysql->charset but it's not clear if that's part of
>> >>> >>>the public API.)
>> >>> >>>
>> >>> >>>However it's detected, the code needs to end up doing:
>> >>> >>>   (...connection charset is utf8...) ? SvUTF8_on(sv) :
>> >>> >>>   SvUTF8_off(sv);
>> >>> >>>on the metadata.
>> >>> >>>
>> >>> >>>
>> >>> >>>*** SET NAMES '...'
>> >>> >>>
>> >>> >>>Intercept SET NAMES and call the mysql_set_character_set() API
>> >>> >>>instead.
>> >>> >>>See
>> >>> >>>http://dev.mysql.com/doc/refman/4.1/en/mysql-set-character-set.html
>> >>> >>>
>> >>> >>>
>> >>> >>>*** Detecting Inconsistencies
>> >>> >>>
>> >>> >>>If the connection character set is _not_ utf8 but the application
>> >>> >>>calls
>> >>> >>>the driver with data (or SQL statement) that has the UTF8 flag set,
>> >>> >>>then
>> >>> >>>it could issue a warning. In practice that may be to be too noisy for
>> >>> >>>people that done their own workarounds for utf8 support. If so then
>> >>> >>>they could be changes to level 1 trace messages.
>> >>> >>>
>> >>> >>>If the connection character set _is_ utf8, and the application calls
>> >>> >>>the driver with data (or SQL statement) that does _not_ have the UTF8
>> >>> >>>flag set but _does_ have bytes with the high bit set, then the driver
>> >>> >>>should issue a warning. The checking for high bit set is an extra
>> >>> >>>cost
>> >>> >>>so this should only be enabled if tracing and/or an attribute is set
>> >>> >>>(perhaps called $dbh->{mysql_charset_checks} = 1)
>> >>> >>>
>> >>> >>>Tim.
>> >>> >>>     
>> >>> >>>
>> >>> 

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About