develooper Front page | perl.perl5.porters | Postings from November 2003

Re: [perl #24328] Case insensitivity and utf-8 in regexes

Thread Previous
From:
Slaven Rezic
Date:
November 1, 2003 17:30
Subject:
Re: [perl #24328] Case insensitivity and utf-8 in regexes
Message ID:
878ymzsikx.fsf@vran.herceg.de
Matthew Lawrence (via RT) <perlbug-followup@perl.org> writes:

> Content-Type: text/plain
> Content-Disposition: inline
> 
> # New Ticket Created by  Matthew Lawrence 
> # Please include the string:  [perl #24328]
> # in the subject line of all future correspondence about this issue. 
> # <URL: http://rt.perl.org/rt2/Ticket/Display.html?id=24328 >
> 
> 
> utf-8 support in case-insensitive regular expressions seems to be broken
> when used in conjunction with the utf8 pragma.
> 
> Please see attached test program and local results.
> 
> $ /usr/bin/perl -v
> 
> This is perl, v5.8.1 built for i686-linux
> 
> Matt
> 
> -- 
> 
> Matthew Lawrence
> Senior Programmer
> Euro RSCG Wnek Gosper Interaction
> 
> +44 (0) 20 7022 4535
> 
> 
> 
> 
> 
> -- attachment  1 ------------------------------------------------------
> url: http://rt.perl.org/rt2/attach/66594/49746/10f25b/utf8_test.pl
> 
> -- attachment  2 ------------------------------------------------------
> url: http://rt.perl.org/rt2/attach/66594/49747/1ca6bd/utf8_test_results
> 

It seems to me that this is caused by a wrong implementation of
Perl_ibcmp_utf8 in utf8.c. This function is supposed to deal with
either utf8 or native encoded strings, but I do not see any references
to a encoding-to-utf8 conversion in this function.

Here's a test case with Inline::C (call with two utf8 or non-utf8
strings and two booleans for flagging if the strings are utf8 or not):



use Inline C => DATA;
test(@ARGV[0..3]);

__DATA__
__C__
void test(char *a, char *b,
	  int u1, int u2) {
  printf("%d\n", ibcmp_utf8(a, 0, strlen(a), u1,
			    b, 0, strlen(b), u2));
}



A naive fix would be:

--- bleedperl/utf8.c	Sun Sep 21 12:08:11 2003
+++ bleedperl2/utf8.c	Sun Nov  2 01:40:25 2003
@@ -1936,7 +1936,7 @@
      STRLEN n1 = 0, n2 = 0;
      U8 foldbuf1[UTF8_MAXLEN_FOLD+1];
      U8 foldbuf2[UTF8_MAXLEN_FOLD+1];
-     U8 natbuf[1+1];
+     U8 natbuf[UTF8_MAXLEN+1];
      STRLEN foldlen1, foldlen2;
      bool match;
      
@@ -1963,7 +1963,8 @@
 	       if (u1)
 		    to_utf8_fold(p1, foldbuf1, &foldlen1);
 	       else {
-		    natbuf[0] = *p1;
+		    char *d = uvchr_to_utf8(natbuf, *p1);
+		    *d = 0;
 		    to_utf8_fold(natbuf, foldbuf1, &foldlen1);
 	       }
 	       q1 = foldbuf1;
@@ -1973,7 +1974,8 @@
 	       if (u2)
 		    to_utf8_fold(p2, foldbuf2, &foldlen2);
 	       else {
-		    natbuf[0] = *p2;
+		    char *d = uvchr_to_utf8(natbuf, *p2);
+		    *d = 0;
 		    to_utf8_fold(natbuf, foldbuf2, &foldlen2);
 	       }
 	       q2 = foldbuf2;

but I fear the current encoding has to be taken into account and maybe
other things. Unicode experts?

Regards,
	Slaven

-- 
Slaven Rezic - slaven@rezic.de

    tkruler - Perl/Tk program for measuring screen distances
    http://ptktools.sourceforge.net/#tkruler

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About