develooper Front page | perl.perl5.porters | Postings from August 2001

UTF-8 bugs in string length & single line regex matches

Thread Next
From:
Daniel P. Berrange
Date:
August 4, 2001 00:06
Subject:
UTF-8 bugs in string length & single line regex matches
Message ID:
20010803113932.A19318@berrange.com
I'm in the process of converting my employeer's perl applications
to use UTF-8 throughout and have come across a couple of
interesting bugs when working with UTF-8 strings and perl 5.7.2.

The first is in the Perl_mg_length function, which causes the 
string length to be reported in bytes rather than characters, 
even though the UTF-8 flag is set. I've attached a patch 
(against 5.7.2) containing a fix & new test case for t/op/length.t

The second, in the regex engine, causes '.' to match against
bytes rather than characters when using the /s operator for 
the regex match. I thought I had a suitable patch, unfortunately
it merely succeeded in breaking \C instead :-( I've attached
it anyway as it may help someone else develop a proper patch
for this problem. Also attached a script to demo the problem.

Dan.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About