develooper Front page | perl.dbi.dev | Postings from May 2006

Re: Adding utf8 support to DBD::mysql

Thread Previous | Thread Next
From:
Martin J. Evans
Date:
May 13, 2006 12:44
Subject:
Re: Adding utf8 support to DBD::mysql
Message ID:
XFMail.20060513204133.martin.evans@easysoft.com
Hi,

This is really a progress report - some of it I've already sent to Patrick.

Attached is a patch to DBD::mysql 3.0003_1 which:

o fixes a few compile issues with were caused by declarations
  in the middle of a block - already sent to Patrick.

o fixes the memory corruption I previously reported on dbi-users.
  Sent to Patrick but the initial fix had problems - I've resent
  the changes to Patrick.

o introduces a small change to get utf8 data back
  All this does is check charsetnr to see if it is 33 and
  turns on the utf8 flag if it is.
  You still need to do one or both of:

$dbh->do("set character set utf8");
$dbh->do("set names utf8");

  to get utf8 back and even then you only get it back if the
  column is defined as utf8 in mysql. I'm not clear exactly which
  of the above 2 settings you need to make as there is contradictory
  evidence on the net about this.

  I've tested this a little - so far as my utf8 defined tables
  which contain utf8 data come back fine and none uf8 columns
  are untouched.

If anyone else fancies testing this or helping with information
about what you need to set (character set or names) I'd be pleased
to hear about it. My primary interest is fairly selfish in that I
need utf8 support but I'm happy to share what I find/change but I
also don't want to step on Patrick's shoes - as such this is cc'ed to
Patrick (and I also appologise for the first attempt I sent
you for fixing the memory corruption - my mistake - I sent an
out of date version).

I should probably also point out that this patch is against
a development release of dbd::mysql and all that implies.

There was also a mail from Henri on dbi-users about using TYPE=>SQL_INTEGER
in bind_param call - I don't do this and can't (currently from the existing
code) see how it would work.

As an aside, if anyone knows a definite list of charsetnr values
in mysql I would be interested in it as my patch is based on
what I've seen rather than what I've read anywhere. I did try
adding get_charset calls to dbdimp.c for DBD::mysql to find
the charset name but to do this I needed to add a load of #includes
to dbdimp.h which caused a lot of compiler warnings. This was
based on code I found at http://bugs.mysql.com/bug.php?id=6911.

Martin
--
Martin J. Evans
Easysoft Ltd, UK
http://www.easysoft.com


On 04-May-2006 Martin J. Evans wrote:
> Tim,
> 
> On 04-May-2006 Tim Bunce wrote:
>> On Sun, Apr 30, 2006 at 01:36:04PM -0700, Patrick Galbraith wrote:
>>> Martin J. Evans wrote:
>>> 
>>> Martin,
>>> 
>>> Thanks much! This is dbdimp.c, right? I  will add this tomorrow (not 
>>> working today), and test it out.
> 
>  
>> Please don't use only is_high_bit_set() to enable UTF8.  That'll break
>> any code that is storing non-utf8 data that happens to have the high-bit
>> set.
>> 
>> Please make sure the test cases cover this situation. It's not enough
>> to get 'utf8 working' its also important to not break existing code.
>> 
>> Using the 'charsetnr' value (see below) looks far more correct. That way
>> perl will treat the values as UTF8 only if mysql was treating it as UTF8.
> 
> Sorry, I should have made it clearer it was only a demonstration that utf8
> can
> work with mysql as someone had been asking that. I had already told Patrick
> that off the list. I fully realised that hack would break 8 bit chrsets.
> 
> I have already started looking at charsetnr but have run into a number of
> issues due to the way charsetnr has changed over different versions of mysql.
> 
> Martin
> --
> Martin J. Evans
> Easysoft Ltd, UK
> http://www.easysoft.com
> 
> 
>>> >>>The keys mysql docs seem to be
>>> >>>http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html
>>> >>>
>>> >>>The mysql api and client->server protocol doesn't support passing
>>> >>>characterset info to the server on a per-statement / per-bind value 
>>> >>>basis.
>>> >>>(http://dev.mysql.com/doc/refman/4.1/en/c-api-prepared-statement-datatype
>>> >>>s
>>> >>>.html)
>>> >>>So the sane way to send utf8 to the server is by setting the 'connection
>>> >>>character set' to utf8 and then only sending utf8 (or its ASCII subset)
>>> >>>to the server on that connection.
>>> >>>
>>> >>>*** Fetching data:
>>> >>>
>>> >>>MySQL 4.1.0 added "unsigned int charsetnr" to the MYSQL_FIELD structure.
>>> >>>It's the "character set number for the field".
>>> >>>
>>> >>>So set the UTF8 flag based on that value. Something like:
>>> >>>   (field->charsetnr = ???) ? SvUTF8_on(sv) : SvUTF8_off(sv);
>>> >>>I couldn't see any docs for the values of the charsetnr field.
>>> >>>
>>> >>>Also, would be good to enable perl code to access the charsetnr values:
>>> >>>   $sth->{mysql_charsetnr}->[$i]
>>> >>>
>>> >>>*** Fetching Metadata:
>>> >>>
>>> >>>The above is a minimum. It doesn't address metadata like field names
>>> >>>($sth->{NAME}) that might also be in utf8. For that the driver needs to
>>> >>>know if the 'connection character set' is currently utf8.
>>> >>>
>>> >>>(The docs mention mysql->charset but it's not clear if that's part of
>>> >>>the public API.)
>>> >>>
>>> >>>However it's detected, the code needs to end up doing:
>>> >>>   (...connection charset is utf8...) ? SvUTF8_on(sv) : SvUTF8_off(sv);
>>> >>>on the metadata.
>>> >>>
>>> >>>
>>> >>>*** SET NAMES '...'
>>> >>>
>>> >>>Intercept SET NAMES and call the mysql_set_character_set() API instead.
>>> >>>See http://dev.mysql.com/doc/refman/4.1/en/mysql-set-character-set.html
>>> >>>
>>> >>>
>>> >>>*** Detecting Inconsistencies
>>> >>>
>>> >>>If the connection character set is _not_ utf8 but the application calls
>>> >>>the driver with data (or SQL statement) that has the UTF8 flag set, then
>>> >>>it could issue a warning. In practice that may be to be too noisy for
>>> >>>people that done their own workarounds for utf8 support. If so then
>>> >>>they could be changes to level 1 trace messages.
>>> >>>
>>> >>>If the connection character set _is_ utf8, and the application calls
>>> >>>the driver with data (or SQL statement) that does _not_ have the UTF8
>>> >>>flag set but _does_ have bytes with the high bit set, then the driver
>>> >>>should issue a warning. The checking for high bit set is an extra cost
>>> >>>so this should only be enabled if tracing and/or an attribute is set
>>> >>>(perhaps called $dbh->{mysql_charset_checks} = 1)
>>> >>>
>>> >>>Tim.
>>> >>>     
>>> >>>
>>> 

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About