develooper Front page | perl.mvs | Postings from December 2000

ebcdic <-> ascii tables interjected in uv <-> utf8 considered harmful

From:
Peter Prymmer
Date:
December 5, 2000 17:56
Subject:
ebcdic <-> ascii tables interjected in uv <-> utf8 considered harmful
Message ID:
Pine.OSF.4.10.10012051757390.152418-100000@aspara.forte.com

Executive summary: it does not work (fortunately ?)

In light of the discussion from a week and a half ago I've gone ahead and
tested a perl@7979.tgz kit on OS/390 V2R5 with a partial implementation
of what might have been a way to put ebcdic<->iso-8859-1 into the uv<->utf8
converter in utf8.c (which is AFAIK the only place where the uv<->utf8 
algorithms lie within perl's guts - corrections welcome on this).
(The implementation was partial in that I did not yet bother to include the 
tables necessary for the POSIX-BC or CP 0037 EBCDIC coded character sets'
translation to and from ISO 8859-1.)  But it does not seem to help, indeed
it harms as gauged by runs of `make test`.

One of the big reasons we could suspect that such tables might be necessary is 
that the characters in the range 128 .. 255 map to two bytes in the UTF-8 
representation.  But the ordinary printing chars such as punctuation, both 
cases of alphabet and digits have the 8th bit set on EBCDIC platforms.  Hence 
to the extent that Perl's internal UTF-8 string handling cares about things 
such as the width of 'A' it could prove important to do such translations.  A 
generalization of this suspicion is that Perl actually cares what characters 
lie in the 128..255 range - which might imply that to get Perl to behave 
properly on a non ISO 8859-1 locale (say, e.g. an Eastern European running 
under ISO 8859-2) what we might need is a locale sensitive way to carry out the 
uv <-> utf8 mappings.  However the failure of the ibm cp 1047<->iso-8859-1 
mappings that I tested implies that we probably do not have to worry about 
locale sensitivity training.  The "failure" in this context 
meant a failure to improve the `make test` results (indeed they actually 
failed more tests).  So apparently Perl does not really care what the 
character interpretation of the 128..255 characters is.

Here were the `make test` results:

-----------------------------------
perl@7979:
Failed 16 test scripts out of 258, 93.80% okay.

perl@7979 + enclosed patch:
Failed 20 test scripts out of 258, 92.25% okay.

perl@7979 + enclosed patch + reversion of lib/utf8.pm to op:
Failed 20 test scripts out of 258, 92.25% okay.
-----------------------------------

So the patch actually made more tests fail.  Here are excerpts from
the diff of the output logs (generated with `make test > make.out 2>&1`)

-----------------------
$ diff -u perl@7979/make.out perl@7979.patched/make.out
[snip]
-comp/require.........String found where operator expected at bleah.pm line 2, 
"
-  (Might be a runaway multi-line "" string starting on line 1)
-       (Missing semicolon on previous line?)
-String found where operator expected at bleah.pm line 1, near "BpBrBiBnBt 
"BoB"
+comp/require.........String found where operator expected at bleah.pm line 1, 
"
        (Do you need to predeclare BpBrBiBnBt?)
 String found where operator expected at bleah.pm line 1, near "BpBrBiBnBt 
"BoB"
        (Do you need to predeclare BpBrBiBnBt?)
 FAILED at test 13
[snip]
-comp/use.............ok
+comp/use.............FAILED at test 16
[snip]
-op/append............ok
+op/append............FAILED at test 8
[snip]
-op/bop...............FAILED at test 22
+op/bop...............FAILED at test 37
[snip]
-op/tr................ok
+op/tr................CEE5213S The signal SIGPIPE was received.
+FAILED at test 16
[snip]
-op/vec...............ok
+op/vec...............FAILED at test 27
[snip]
-lib/charnames........FAILED at test 12
+lib/charnames........FAILED at test 11
-----------------------

The extra failures in op/tr and op/vec for the patched perl are clearly
coded character set failures but I think that they indicate a failure
of this patches approach since the special ebcdic tests are still passing.
Note this result with the pathced binary:

$ ./perl op/tr.t
1..29
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
ok 7
ok 8
ok 9
ok 10
ok 11
ok 12
not (100.125.60) ok 13
not (100.125.60) ok 14
ok 15
not ok 16
[snip]

FWIW here was the patch that I had tested.  I know I can identify
OS/390 with the __MVS__ macro and BS/2000 with the _OSD_POSIX macro
but I do not know the system identifier preprocessor macros for 
VM/ESA, VSE/ESA, or OS/400 are.  I asked what they were on an as400
newsgroup (asking for personal email response since my news access is spotty)
about a year ago and never received any repsonse.  Hence the test of
the value of the '^' character on EBCDIC platforms:

diff -ru perl.7979.orig/perl.h perl/perl.h
--- perl.7979.orig/perl.h	Sun Dec  3 17:57:42 2000
+++ perl/perl.h	Mon Dec  4 17:06:56 2000
@@ -2405,6 +2405,80 @@
 
 #ifdef DOINIT
 #ifdef EBCDIC
+#if '^' == 106  /* if defined(_OSD_POSIX) POSIX-BC */
+#endif          /* POSIX-BC */
+#if '^' == 176  /* if defined(??) (OS/400?) 037 */
+#endif          /* 037 */
+#if '^' == 95   /* if defined(__MVS__) || defined(??) (VM/ESA?) 1047 */
+EXT unsigned char PL_e2a[] = { /* ASCII (ISO8859-1) to EBCDIC (IBM-1047) */
+    0,      1,      2,      3,      55,     45,     46,     47,
+    22,     5,      21,     11,     12,     13,     14,     15,
+    16,     17,     18,     19,     60,     61,     50,     38,
+    24,     25,     63,     39,     28,     29,     30,     31,
+    64,     90,     127,    123,    91,     108,    80,     125,
+    77,     93,     92,     78,     107,    96,     75,     97,
+    240,    241,    242,    243,    244,    245,    246,    247,
+    248,    249,    122,    94,     76,     126,    110,    111,
+    124,    193,    194,    195,    196,    197,    198,    199,
+    200,    201,    209,    210,    211,    212,    213,    214,
+    215,    216,    217,    226,    227,    228,    229,    230,
+    231,    232,    233,    173,    224,    189,    95,     109,
+    121,    129,    130,    131,    132,    133,    134,    135,
+    136,    137,    145,    146,    147,    148,    149,    150,
+    151,    152,    153,    162,    163,    164,    165,    166,
+    167,    168,    169,    192,    79,     208,    161,    7,
+    32,     33,     34,     35,     36,     37,     6,      23,
+    40,     41,     42,     43,     44,     9,      10,     27,
+    48,     49,     26,     51,     52,     53,     54,     8,
+    56,     57,     58,     59,     4,      20,     62,     255,
+    65,     170,    74,     177,    159,    178,    106,    181,
+    187,    180,    154,    138,    176,    202,    175,    188,
+    144,    143,    234,    250,    190,    160,    182,    179,
+    157,    218,    155,    139,    183,    184,    185,    171,
+    100,    101,    98,     102,    99,     103,    158,    104,
+    116,    113,    114,    115,    120,    117,    118,    119,
+    172,    105,    237,    238,    235,    239,    236,    191,
+    128,    253,    254,    251,    252,    186,    174,    89,
+    68,     69,     66,     70,     67,     71,     156,    72,
+    84,     81,     82,     83,     88,     85,     86,     87,
+    140,    73,     205,    206,    203,    207,    204,    225,
+    112,    221,    222,    219,    220,    141,    142,    223
+};
+EXT unsigned char PL_a2e[] = { /* EBCDIC (IBM-1047) to ASCII (ISO8859-1) */
+    0,      1,      2,      3,      156,    9,      134,    127,
+    151,    141,    142,    11,     12,     13,     14,     15,
+    16,     17,     18,     19,     157,    10,     8,      135,
+    24,     25,     146,    143,    28,     29,     30,     31,
+    128,    129,    130,    131,    132,    133,    23,     27,
+    136,    137,    138,    139,    140,    5,      6,      7,
+    144,    145,    22,     147,    148,    149,    150,    4,
+    152,    153,    154,    155,    20,     21,     158,    26,
+    32,     160,    226,    228,    224,    225,    227,    229,
+    231,    241,    162,    46,     60,     40,     43,     124,
+    38,     233,    234,    235,    232,    237,    238,    239,
+    236,    223,    33,     36,     42,     41,     59,     94,
+    45,     47,     194,    196,    192,    193,    195,    197,
+    199,    209,    166,    44,     37,     95,     62,     63,
+    248,    201,    202,    203,    200,    205,    206,    207,
+    204,    96,     58,     35,     64,     39,     61,     34,
+    216,    97,     98,     99,     100,    101,    102,    103,
+    104,    105,    171,    187,    240,    253,    254,    177,
+    176,    106,    107,    108,    109,    110,    111,    112,
+    113,    114,    170,    186,    230,    184,    198,    164,
+    181,    126,    115,    116,    117,    118,    119,    120,
+    121,    122,    161,    191,    208,    91,     222,    174,
+    172,    163,    165,    183,    169,    167,    182,    188,
+    189,    190,    221,    168,    175,    93,     180,    215,
+    123,    65,     66,     67,     68,     69,     70,     71,
+    72,     73,     173,    244,    246,    242,    243,    245,
+    125,    74,     75,     76,     77,     78,     79,     80,
+    81,     82,     185,    251,    252,    249,    250,    255,
+    92,     247,    83,     84,     85,     86,     87,     88,
+    89,     90,     178,    212,    214,    210,    211,    213,
+    48,     49,    50,      51,     52,     53,     54,     55,
+    56,     57,    179,     219,    220,    217,    218,    159
+};
+#endif          /* 1047 */
 EXT unsigned char PL_fold[] = { /* fast EBCDIC case folding table */
     0,      1,      2,      3,      4,      5,      6,      7,
     8,      9,      10,     11,     12,     13,     14,     15,
@@ -2477,6 +2551,10 @@
 #endif  /* !EBCDIC */
 #else
 EXTCONST unsigned char PL_fold[];
+#ifdef EBCDIC
+EXTCONST unsigned char PL_e2a[];
+EXTCONST unsigned char PL_a2e[];
+#endif /* EBCDIC */
 #endif
 
 #ifdef DOINIT
diff -ru perl.7979.orig/utf8.c perl/utf8.c
--- perl.7979.orig/utf8.c	Sun Dec  3 11:47:54 2000
+++ perl/utf8.c	Mon Dec  4 17:06:57 2000
@@ -29,6 +29,10 @@
 U8 *
 Perl_uv_to_utf8(pTHX_ U8 *d, UV uv) /* the d must be UTF8_MAXLEN+1 deep */
 {
+#ifdef EBCDIC
+    if (uv <= 0xff)
+        uv = (UV)PL_e2a[uv];
+#endif
     if (uv < 0x80) {
 	*d++ = uv;
 	*d   = 0;
@@ -218,7 +222,11 @@
     if (uv <= 0x7f) { /* Pure ASCII. */
 	if (retlen)
 	    *retlen = 1;
+#ifdef EBCDIC
+	return (UV)PL_a2e[*s];
+#else
 	return *s;
+#endif
     }
 
     if ((uv >= 0x80 && uv <= 0xbf) &&
@@ -326,7 +334,14 @@
 	goto malformed;
     }
 
+#ifdef EBCDIC
+    if (uv <= 0xff) 
+        return (UV)PL_a2e[uv];
+    else 
+        return uv;
+#else
     return uv;
+#endif
 
 malformed:
 
End of "diff" - not a patch.

And as a bonus here is a little program that can be used to generate utf8 
tables on ascii or ebcdic machines by explicitly testing perl's utf8 
transformation algorithms.  It can be compiled and linked just like the 
perl binary is:

/*

Date: Mon, 27 Nov 2000 18:50:50 -0800 (PST)
From: Peter Prymmer <pvhp>
Subject: utf8_tbl.c 

Particularly interesting entries include:

   ./utf8_tbl 127 128
   ./utf8_tbl 255 256
   ./utf8_tbl 2047 2048
   ./utf8_tbl 65535 65536
   ./utf8_tbl 2097151 2097152
   ./utf8_tbl 67108863 67108864
   ./utf8_tbl 2147483647 2147483648  # warning: atoi cannot fit latter number
                                     # into UVTYPE (unsigned long) on linux-ppc
                         # also: UVTYPE is unsigned long on VMS
                         # On either linux-ppc or VMS 
                         # (UVTYPE)2147483648 -> (maps to) -> -2147483648
                         # (UVTYPE)2147483649 -> (maps to) -> -2147483647
                         # On OS/390 there is no unsigned long long and no
                         # strtoull but there is strtoul and it 
                         # automatically converts 2147483648 and higher to
                         # 2147483647.
*/

#include <stdlib.h>    /* strtoull, strtoul, or atoi   */
#include <stdio.h>     /* printf */
#include "embed.h"
#include "EXTERN.h"
#include "perl.h"

int main(int argc, char *argv[])
{
/* unsigned long long is typical see `grep U64TYPE config.h` */
#if defined(U64TYPE) && ! defined(__MVS__)
    U64TYPE start = 0;
    U64TYPE stop = 255;
    U64TYPE i;
# ifdef UVuf
# undef UVuf        /* "lu" is typical */
# define UVuf "Lu"  /* ok for gcc on linux-ppc, DECC on VMS */
# endif
    U64TYPE uv;
#else
    UVTYPE start = 0;
    UVTYPE stop = 255;
    UVTYPE i;
    UV uv;
#endif
    int j;
    U8 * d;
    UV uvr = 0;
    STRLEN u8len, *uvlen;

 /*   U32 flags = 0;                   */
 /*   U32 flags = UTF8_CHECK_ONLY;     */
    U32 flags = UTF8_ALLOW_ANY ;   /*  */

    U8 tmpbuf[UTF8_MAXLEN] = {0,0,0,0,0,0,0,0,0,0,0,0,0};
    /*                       "0 1 2 3 4 5 6 7 8 9 A B C"; */
    U8 *cpybuf;

    if (argc == 2) {
#ifdef HAS_STRTOULL
        stop = strtoull(argv[1],(char **)NULL,10);
#else
# ifdef HAS_STRTOUL
        stop = strtoul(argv[1],(char **)NULL,10);
# else
        stop = atoi(argv[1]);
# endif
#endif
    }
    if (argc == 3) {
#ifdef HAS_STRTOULL
        start = strtoull(argv[1],(char **)NULL,10);
        stop = strtoull(argv[2],(char **)NULL,10);
#else
        start = atoi(argv[1]);
        stop = atoi(argv[2]);
#endif
    }
    if (stop < start) {
        printf("Usage:\n\t%s (0..255)\n\t%s stop (0..stop)\n\t%s start stop\n",
                 argv[0],argv[0],argv[0]);
        return(1);
    }

        printf("Start: %"UVuf"\tStop: %"UVuf"\n",start,stop);
        printf("i\t->\tutf8 (u8len)\t->\tuv_ret (uvlen)\t->\tuv_cast\n");

    for (i = start; i <= stop; i++) {

#ifdef U64TYPE
        uv = (U64TYPE) i;
        d = uv_to_utf8( tmpbuf, (UV)uv); 
#else
        uv = (UV) i;
        d = uv_to_utf8( tmpbuf, uv); 
#endif
        printf("%"UVuf"\t->\t%d",i,tmpbuf[0]);
        u8len = 0;
        for (j = 1; j < UTF8_MAXLEN; j++) {
            if (tmpbuf[j] != 0) {
                printf(".%d",tmpbuf[j]);
                u8len++;
            }
            else { 
                j = UTF8_MAXLEN; 
            }
        }
        u8len++; 
        cpybuf = tmpbuf;
        uvlen = &u8len;

        uvr = utf8_to_uv( cpybuf, u8len , uvlen, flags );  
/*
   Unfortunately, the sizeof(unsigned int) == sizeof(unsigned long) == 4 
   on linux-ppc, gccversion='egcs-2.91.66 19990314 (egcs-1.1.2 release)'
   but printf with a %lu format generates warnings under -Wall for
   the STRLEN entries, whereas the %u format does not.
 */
#ifdef UVuf_OK
        printf(" (%"UVuf")\t->\t%ld (%"UVuf")\t->\t%ld\n",u8len,uvr,*uvlen,uv); 
#else
# ifdef U64TYPE
        printf(" (%u)\t->\t%ld (%u)\t->\t%"UVuf"\n",u8len,uvr,*uvlen,(U64TYPE)uv); 
# else
        printf(" (%u)\t->\t%ld (%u)\t->\t%ld\n",u8len,uvr,*uvlen,uv);
# endif
#endif

        /* ensure that all are reset to zero so as to test each code point
           we aren't in any hurry btw 
         */
        for (j = 0; j < UTF8_MAXLEN; j++) {
                tmpbuf[j] = 0;
        }

    }

    return(0);

}

Here for example was the modification made to perl's extracted Makefile 
that allowed compiling and linking utf8_tbl:

--- Makefile.orig	Tue Dec  5 12:52:57 2000
+++ Makefile	Tue Dec  5 12:53:11 2000
@@ -186,6 +186,9 @@
 perlmain$(OBJ_EXT): perlmain.c
 	$(CCCMD) $(PLDLFLAGS) $*.c
 
+utf8_tbl$(OBJ_EXT): utf8_tbl.c
+	$(CCCMD) $(PLDLFLAGS) $*.c
+
 # The file ext.libs is a list of libraries that must be linked in
 # for static extensions, e.g. -lm -lgdbm, etc.  The individual
 # static extension Makefile's add to it.
@@ -214,6 +217,9 @@
 
 perl: $& perlmain$(OBJ_EXT) $(LIBPERL) $(DYNALOADER) $(static_ext) ext.libs $(PERLEXPORT)
 	$(SHRPENV) $(LDLIBPTH) $(CC) $(CLDFLAGS) $(CCDLFLAGS) -o perl perlmain$(OBJ_EXT) $(DYNALOADER) $(static_ext) $(LLIBPERL) `cat ext.libs` $(libs)
+
+utf8_tbl: $& utf8_tbl$(OBJ_EXT) $(LIBPERL) $(DYNALOADER) $(static_ext) ext.libs $(PERLEXPORT)
+	$(SHRPENV) $(LDLIBPTH) $(CC) $(CLDFLAGS) $(CCDLFLAGS) -o utf8_tbl utf8_tbl$(OBJ_EXT) $(DYNALOADER) $(static_ext) $(LLIBPERL) `cat ext.libs` $(libs)
 
 pureperl: $& perlmain$(OBJ_EXT) $(LIBPERL) $(DYNALOADER) $(static_ext) ext.libs $(PERLEXPORT)
 	$(SHRPENV) $(LDLIBPTH) purify $(CC) $(CLDFLAGS) $(CCDLFLAGS) -o pureperl perlmain$(OBJ_EXT) $(DYNALOADER) $(static_ext) $(LLIBPERL) `cat ext.libs` $(libs)
End of diff to Makfile hack

Here are the head and tail of a default run on OS/390 (default meaning Start at 
0 and proceed to 255):

$ ./utf8_tbl | head
Start: 0        Stop: 255
i       ->      utf8 (u8len)    ->      uv_ret (uvlen)  ->      uv_cast
0       ->      0 (1)   ->      0 (1)   ->      0
1       ->      1 (1)   ->      1 (1)   ->      1
2       ->      2 (1)   ->      2 (1)   ->      2
3       ->      3 (1)   ->      3 (1)   ->      3
4       ->      55 (1)  ->      4 (1)   ->      4
5       ->      45 (1)  ->      5 (1)   ->      5
6       ->      46 (1)  ->      6 (1)   ->      6
7       ->      47 (1)  ->      7 (1)   ->      7
CEE5213S The signal SIGPIPE was received.
$ ./utf8_tbl | tail
246     ->      195.140 (2)     ->      246 (2) ->      246
247     ->      195.161 (2)     ->      247 (2) ->      247
248     ->      112 (1) ->      248 (1) ->      248
249     ->      195.157 (2)     ->      249 (2) ->      249
250     ->      195.158 (2)     ->      250 (2) ->      250
251     ->      195.155 (2)     ->      251 (2) ->      251
252     ->      195.156 (2)     ->      252 (2) ->      252
253     ->      194.141 (2)     ->      253 (2) ->      253
254     ->      194.142 (2)     ->      254 (2) ->      254
255     ->      195.159 (2)     ->      255 (2) ->      255

Peter Prymmer





nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About