develooper Front page | perl.perl5.porters | Postings from October 2005

Re: [PATCH] undef IS_UTF8_CHAR() on EBCDIC

Thread Previous | Thread Next
From:
SADAHIRO Tomoyuki
Date:
October 2, 2005 01:04
Subject:
Re: [PATCH] undef IS_UTF8_CHAR() on EBCDIC
Message ID:
20051002165240.E649.BQW10602@nifty.com

Hello.

To emulate EBCDIC platform, I ran a codelet like below
with #define EBCDIC.

    for (uv = 0; uv <= PERL_UNICODE_MAX; uv++) {
	memzero(buff,UTF8_MAXBYTES);
	d = buff;
	d = uvuni_to_utf8(d, uv);
	if (!is_utf8_string(buff, d - buff))
	    ++fail;
    }

Even though IS_UTF8_CHAR is removed, it still fails in 64 cases
within uv = 0x0000..0x10ffff [uv failed are 00A0..00BF (start = 0x80)
and 0260..027F (start = 0xa0) ]

Results
(1) perl-current:
    0x3ff60 failures in uv = 0x0000..0x10ffff.
    0x443ff60 failures in uv = 0x0000..0x7fffffff.

(2) change utf8.h but not change utf8.c:
    0x40 failures in uv = 0x0000..0x10ffff.
    0x4400040 failures in uv = 0x0000..0x7fffffff.

(3) change both utf8.h and is_utf8_char_slow (see a patch below):
    no failure in uv = 0x0000..0x10ffff.
    no failure in uv = 0x0000..0x7fffffff.

I found that is_utf8_char_slow() in utf.c tries to get a UTF-EBCDIC
value without the conversion of the start octet from UTF-EBCDIC
to I8-sequence. The above failures are due to fail to apply
NATIVE_TO_UTF() to the start octet.

Regards,
sadahiro tomoyuki

! utf8.c utf8.h

diff -ur perl~/utf8.c perl/utf8.c
--- perl~/utf8.c	Tue Jul 19 00:53:16 2005
+++ perl/utf8.c	Sun Oct 02 16:18:56 2005
@@ -209,6 +209,9 @@
 
     slen = len - 1;
     s++;
+#ifdef EBCDIC
+    u = NATIVE_TO_UTF(u);
+#endif
     u &= UTF_START_MASK(len);
     uv  = u;
     ouv = uv;
diff -ur perl~/utf8.h perl/utf8.h
--- perl~/utf8.h	Wed Jun 08 00:04:28 2005
+++ perl/utf8.h	Sun Oct 02 15:47:26 2005
@@ -258,6 +258,9 @@
 #endif
 #define SHARP_S_SKIP 2
 
+#ifdef EBCDIC
+/* IS_UTF8_CHAR() is not ported to EBCDIC */
+#else
 #define IS_UTF8_CHAR_1(p)	\
 	((p)[0] <= 0x7F)
 #define IS_UTF8_CHAR_2(p)	\
@@ -329,3 +332,4 @@
 
 #define IS_UTF8_CHAR_FAST(n) ((n) <= 4)
 
+#endif /* IS_UTF8_CHAR() for UTF-8 */




Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About