develooper Front page | perl.perl5.porters | Postings from September 2005

[PATCH] undef IS_UTF8_CHAR() on EBCDIC

Thread Previous | Thread Next
From:
SADAHIRO Tomoyuki
Date:
September 30, 2005 22:12
Subject:
[PATCH] undef IS_UTF8_CHAR() on EBCDIC
Message ID:
20051001141153.A382.BQW10602@nifty.com

Hello,

the following is a part of a preprocessor output from utf8.c
with the identifier EBCDIC defined.
(some linefeed characters are inserted for clarity)

++++++++++++++++++++++++++
bool
Perl_is_utf8_string(pTHX_ const U8 *s, STRLEN len)
{
    const U8* x = s;
    const U8* send;

    if (!len && s)
	len = strlen((const char *)s);
    send = s + len;

    while (x < send) {
	STRLEN c;
	 /* Inline the easy bits of is_utf8_char() here for speed... */
	 if (((PL_e2utf[(U8)(*x)]) < 0xA0))
	      c = 1;
	 else if (!(PL_e2utf[(U8)(*x)] >= 0xA0 && (PL_e2utf[(U8)(*x)] & 0xE0) != 0xA0))
	     goto out;
	 else {
	      /* ... and call is_utf8_char() only if really needed. */

	     c = PL_utf8skip[*(const U8*)x];
	     if (((c) <= 4)) {
	         if (!((c) == 1 ? ((x)[0] <= 0x7F) : (c) == 2 ?
((x)[0] >= 0xC2 && (x)[0] <= 0xDF && (x)[1] >= 0x80 && (x)[1] <= 0xBF) :
(c) == 3 ? (((x)[0] == 0xE0 && (x)[1] >= 0xA0 && (x)[1] <= 0xBF &&
(x)[2] >= 0x80 && (x)[2] <= 0xBF) || ((x)[0] >= 0xE1 && (x)[0] <= 0xEC &&
(x)[1] >= 0x80 && (x)[1] <= 0xBF && (x)[2] >= 0x80 && (x)[2] <= 0xBF) ||
((x)[0] == 0xED && (x)[1] >= 0x80 && (x)[1] <= 0xBF && (x)[2] >= 0x80 &&
(x)[2] <= 0xBF) || ((x)[0] >= 0xEE && (x)[0] <= 0xEF && (x)[1] >= 0x80 &&
(x)[1] <= 0xBF && (x)[2] >= 0x80 && (x)[2] <= 0xBF)) : (c) == 4 ?
(((x)[0] == 0xF0 && (x)[1] >= 0x90 && (x)[1] <= 0xBF && (x)[2] >= 0x80 &&
(x)[2] <= 0xBF && (x)[3] >= 0x80 && (x)[3] <= 0xBF) || ((x)[0] >= 0xF1 &&
(x)[0] <= 0xF3 && (x)[1] >= 0x80 && (x)[1] <= 0xBF && (x)[2] >= 0x80 &&
(x)[2] <= 0xBF && (x)[3] >= 0x80 && (x)[3] <= 0xBF) || ((x)[0] == 0xF4 &&
(x)[0] <= 0xF7 && (x)[1] >= 0x80 && (x)[1] <= 0xBF && (x)[2] >= 0x80 &&
(x)[2] <= 0xBF && (x)[3] >= 0x80 && (x)[3] <= 0xBF)) : 0))
		     goto out;
	     } else if (!is_utf8_char_slow(x, c))
	         goto out;



	      if (!c)
		  goto out;
	 }
        x += c;
    }

 out:
    if (x != send)
	return FALSE;

    return TRUE;
}
++++++++++++++++++++++++++

On EBCDIC platform, is_utf8_string() must see if the string
is in * UTF-EBCDIC *.

But is_utf8_string() falsely uses IS_UTF8_CHAR() macro and will see
if the string is in * UTF-8 * except INVARIANT characters.
Moreover, is_utf8_char() and is_utf8_string_loclen() also have
same problem. That makes no sense.

Thus I presume perl often handles multiple-octet UTF-EBCDIC characters
incorrectly, after the introduction of IS_UTF8_CHAR().

This is a patch that makes IS_UTF8_CHAR() not to be defined
with EBCDIC.

Regards,
sadahiro tomoyuki

diff -ur perl~/utf8.h perl/utf8.h
--- perl~/utf8.h	Wed Jun 08 00:04:28 2005
+++ perl/utf8.h	Sat Oct 01 13:35:02 2005
@@ -258,6 +258,9 @@
 #endif
 #define SHARP_S_SKIP 2
 
+#ifdef EBCDIC
+/* IS_UTF8_CHAR() is not ported to EBCDIC */
+#else
 #define IS_UTF8_CHAR_1(p)	\
 	((p)[0] <= 0x7F)
 #define IS_UTF8_CHAR_2(p)	\
@@ -329,3 +332,4 @@
 
 #define IS_UTF8_CHAR_FAST(n) ((n) <= 4)
 
+#endif /* IS_UTF8_CHAR() for UTF-8 */







Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About