Front page | perl.mvs |
Postings from December 2000
ebcdic <-> ascii tables interjected in uv <-> utf8 considered harmful
From:
Peter Prymmer
Date:
December 5, 2000 17:56
Subject:
ebcdic <-> ascii tables interjected in uv <-> utf8 considered harmful
Message ID:
Pine.OSF.4.10.10012051757390.152418-100000@aspara.forte.com
Executive summary: it does not work (fortunately ?)
In light of the discussion from a week and a half ago I've gone ahead and
tested a perl@7979.tgz kit on OS/390 V2R5 with a partial implementation
of what might have been a way to put ebcdic<->iso-8859-1 into the uv<->utf8
converter in utf8.c (which is AFAIK the only place where the uv<->utf8
algorithms lie within perl's guts - corrections welcome on this).
(The implementation was partial in that I did not yet bother to include the
tables necessary for the POSIX-BC or CP 0037 EBCDIC coded character sets'
translation to and from ISO 8859-1.) But it does not seem to help, indeed
it harms as gauged by runs of `make test`.
One of the big reasons we could suspect that such tables might be necessary is
that the characters in the range 128 .. 255 map to two bytes in the UTF-8
representation. But the ordinary printing chars such as punctuation, both
cases of alphabet and digits have the 8th bit set on EBCDIC platforms. Hence
to the extent that Perl's internal UTF-8 string handling cares about things
such as the width of 'A' it could prove important to do such translations. A
generalization of this suspicion is that Perl actually cares what characters
lie in the 128..255 range - which might imply that to get Perl to behave
properly on a non ISO 8859-1 locale (say, e.g. an Eastern European running
under ISO 8859-2) what we might need is a locale sensitive way to carry out the
uv <-> utf8 mappings. However the failure of the ibm cp 1047<->iso-8859-1
mappings that I tested implies that we probably do not have to worry about
locale sensitivity training. The "failure" in this context
meant a failure to improve the `make test` results (indeed they actually
failed more tests). So apparently Perl does not really care what the
character interpretation of the 128..255 characters is.
Here were the `make test` results:
-----------------------------------
perl@7979:
Failed 16 test scripts out of 258, 93.80% okay.
perl@7979 + enclosed patch:
Failed 20 test scripts out of 258, 92.25% okay.
perl@7979 + enclosed patch + reversion of lib/utf8.pm to op:
Failed 20 test scripts out of 258, 92.25% okay.
-----------------------------------
So the patch actually made more tests fail. Here are excerpts from
the diff of the output logs (generated with `make test > make.out 2>&1`)
-----------------------
$ diff -u perl@7979/make.out perl@7979.patched/make.out
[snip]
-comp/require.........String found where operator expected at bleah.pm line 2,
"
- (Might be a runaway multi-line "" string starting on line 1)
- (Missing semicolon on previous line?)
-String found where operator expected at bleah.pm line 1, near "BpBrBiBnBt
"BoB"
+comp/require.........String found where operator expected at bleah.pm line 1,
"
(Do you need to predeclare BpBrBiBnBt?)
String found where operator expected at bleah.pm line 1, near "BpBrBiBnBt
"BoB"
(Do you need to predeclare BpBrBiBnBt?)
FAILED at test 13
[snip]
-comp/use.............ok
+comp/use.............FAILED at test 16
[snip]
-op/append............ok
+op/append............FAILED at test 8
[snip]
-op/bop...............FAILED at test 22
+op/bop...............FAILED at test 37
[snip]
-op/tr................ok
+op/tr................CEE5213S The signal SIGPIPE was received.
+FAILED at test 16
[snip]
-op/vec...............ok
+op/vec...............FAILED at test 27
[snip]
-lib/charnames........FAILED at test 12
+lib/charnames........FAILED at test 11
-----------------------
The extra failures in op/tr and op/vec for the patched perl are clearly
coded character set failures but I think that they indicate a failure
of this patches approach since the special ebcdic tests are still passing.
Note this result with the pathced binary:
$ ./perl op/tr.t
1..29
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
ok 7
ok 8
ok 9
ok 10
ok 11
ok 12
not (100.125.60) ok 13
not (100.125.60) ok 14
ok 15
not ok 16
[snip]
FWIW here was the patch that I had tested. I know I can identify
OS/390 with the __MVS__ macro and BS/2000 with the _OSD_POSIX macro
but I do not know the system identifier preprocessor macros for
VM/ESA, VSE/ESA, or OS/400 are. I asked what they were on an as400
newsgroup (asking for personal email response since my news access is spotty)
about a year ago and never received any repsonse. Hence the test of
the value of the '^' character on EBCDIC platforms:
diff -ru perl.7979.orig/perl.h perl/perl.h
--- perl.7979.orig/perl.h Sun Dec 3 17:57:42 2000
+++ perl/perl.h Mon Dec 4 17:06:56 2000
@@ -2405,6 +2405,80 @@
#ifdef DOINIT
#ifdef EBCDIC
+#if '^' == 106 /* if defined(_OSD_POSIX) POSIX-BC */
+#endif /* POSIX-BC */
+#if '^' == 176 /* if defined(??) (OS/400?) 037 */
+#endif /* 037 */
+#if '^' == 95 /* if defined(__MVS__) || defined(??) (VM/ESA?) 1047 */
+EXT unsigned char PL_e2a[] = { /* ASCII (ISO8859-1) to EBCDIC (IBM-1047) */
+ 0, 1, 2, 3, 55, 45, 46, 47,
+ 22, 5, 21, 11, 12, 13, 14, 15,
+ 16, 17, 18, 19, 60, 61, 50, 38,
+ 24, 25, 63, 39, 28, 29, 30, 31,
+ 64, 90, 127, 123, 91, 108, 80, 125,
+ 77, 93, 92, 78, 107, 96, 75, 97,
+ 240, 241, 242, 243, 244, 245, 246, 247,
+ 248, 249, 122, 94, 76, 126, 110, 111,
+ 124, 193, 194, 195, 196, 197, 198, 199,
+ 200, 201, 209, 210, 211, 212, 213, 214,
+ 215, 216, 217, 226, 227, 228, 229, 230,
+ 231, 232, 233, 173, 224, 189, 95, 109,
+ 121, 129, 130, 131, 132, 133, 134, 135,
+ 136, 137, 145, 146, 147, 148, 149, 150,
+ 151, 152, 153, 162, 163, 164, 165, 166,
+ 167, 168, 169, 192, 79, 208, 161, 7,
+ 32, 33, 34, 35, 36, 37, 6, 23,
+ 40, 41, 42, 43, 44, 9, 10, 27,
+ 48, 49, 26, 51, 52, 53, 54, 8,
+ 56, 57, 58, 59, 4, 20, 62, 255,
+ 65, 170, 74, 177, 159, 178, 106, 181,
+ 187, 180, 154, 138, 176, 202, 175, 188,
+ 144, 143, 234, 250, 190, 160, 182, 179,
+ 157, 218, 155, 139, 183, 184, 185, 171,
+ 100, 101, 98, 102, 99, 103, 158, 104,
+ 116, 113, 114, 115, 120, 117, 118, 119,
+ 172, 105, 237, 238, 235, 239, 236, 191,
+ 128, 253, 254, 251, 252, 186, 174, 89,
+ 68, 69, 66, 70, 67, 71, 156, 72,
+ 84, 81, 82, 83, 88, 85, 86, 87,
+ 140, 73, 205, 206, 203, 207, 204, 225,
+ 112, 221, 222, 219, 220, 141, 142, 223
+};
+EXT unsigned char PL_a2e[] = { /* EBCDIC (IBM-1047) to ASCII (ISO8859-1) */
+ 0, 1, 2, 3, 156, 9, 134, 127,
+ 151, 141, 142, 11, 12, 13, 14, 15,
+ 16, 17, 18, 19, 157, 10, 8, 135,
+ 24, 25, 146, 143, 28, 29, 30, 31,
+ 128, 129, 130, 131, 132, 133, 23, 27,
+ 136, 137, 138, 139, 140, 5, 6, 7,
+ 144, 145, 22, 147, 148, 149, 150, 4,
+ 152, 153, 154, 155, 20, 21, 158, 26,
+ 32, 160, 226, 228, 224, 225, 227, 229,
+ 231, 241, 162, 46, 60, 40, 43, 124,
+ 38, 233, 234, 235, 232, 237, 238, 239,
+ 236, 223, 33, 36, 42, 41, 59, 94,
+ 45, 47, 194, 196, 192, 193, 195, 197,
+ 199, 209, 166, 44, 37, 95, 62, 63,
+ 248, 201, 202, 203, 200, 205, 206, 207,
+ 204, 96, 58, 35, 64, 39, 61, 34,
+ 216, 97, 98, 99, 100, 101, 102, 103,
+ 104, 105, 171, 187, 240, 253, 254, 177,
+ 176, 106, 107, 108, 109, 110, 111, 112,
+ 113, 114, 170, 186, 230, 184, 198, 164,
+ 181, 126, 115, 116, 117, 118, 119, 120,
+ 121, 122, 161, 191, 208, 91, 222, 174,
+ 172, 163, 165, 183, 169, 167, 182, 188,
+ 189, 190, 221, 168, 175, 93, 180, 215,
+ 123, 65, 66, 67, 68, 69, 70, 71,
+ 72, 73, 173, 244, 246, 242, 243, 245,
+ 125, 74, 75, 76, 77, 78, 79, 80,
+ 81, 82, 185, 251, 252, 249, 250, 255,
+ 92, 247, 83, 84, 85, 86, 87, 88,
+ 89, 90, 178, 212, 214, 210, 211, 213,
+ 48, 49, 50, 51, 52, 53, 54, 55,
+ 56, 57, 179, 219, 220, 217, 218, 159
+};
+#endif /* 1047 */
EXT unsigned char PL_fold[] = { /* fast EBCDIC case folding table */
0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15,
@@ -2477,6 +2551,10 @@
#endif /* !EBCDIC */
#else
EXTCONST unsigned char PL_fold[];
+#ifdef EBCDIC
+EXTCONST unsigned char PL_e2a[];
+EXTCONST unsigned char PL_a2e[];
+#endif /* EBCDIC */
#endif
#ifdef DOINIT
diff -ru perl.7979.orig/utf8.c perl/utf8.c
--- perl.7979.orig/utf8.c Sun Dec 3 11:47:54 2000
+++ perl/utf8.c Mon Dec 4 17:06:57 2000
@@ -29,6 +29,10 @@
U8 *
Perl_uv_to_utf8(pTHX_ U8 *d, UV uv) /* the d must be UTF8_MAXLEN+1 deep */
{
+#ifdef EBCDIC
+ if (uv <= 0xff)
+ uv = (UV)PL_e2a[uv];
+#endif
if (uv < 0x80) {
*d++ = uv;
*d = 0;
@@ -218,7 +222,11 @@
if (uv <= 0x7f) { /* Pure ASCII. */
if (retlen)
*retlen = 1;
+#ifdef EBCDIC
+ return (UV)PL_a2e[*s];
+#else
return *s;
+#endif
}
if ((uv >= 0x80 && uv <= 0xbf) &&
@@ -326,7 +334,14 @@
goto malformed;
}
+#ifdef EBCDIC
+ if (uv <= 0xff)
+ return (UV)PL_a2e[uv];
+ else
+ return uv;
+#else
return uv;
+#endif
malformed:
End of "diff" - not a patch.
And as a bonus here is a little program that can be used to generate utf8
tables on ascii or ebcdic machines by explicitly testing perl's utf8
transformation algorithms. It can be compiled and linked just like the
perl binary is:
/*
Date: Mon, 27 Nov 2000 18:50:50 -0800 (PST)
From: Peter Prymmer <pvhp>
Subject: utf8_tbl.c
Particularly interesting entries include:
./utf8_tbl 127 128
./utf8_tbl 255 256
./utf8_tbl 2047 2048
./utf8_tbl 65535 65536
./utf8_tbl 2097151 2097152
./utf8_tbl 67108863 67108864
./utf8_tbl 2147483647 2147483648 # warning: atoi cannot fit latter number
# into UVTYPE (unsigned long) on linux-ppc
# also: UVTYPE is unsigned long on VMS
# On either linux-ppc or VMS
# (UVTYPE)2147483648 -> (maps to) -> -2147483648
# (UVTYPE)2147483649 -> (maps to) -> -2147483647
# On OS/390 there is no unsigned long long and no
# strtoull but there is strtoul and it
# automatically converts 2147483648 and higher to
# 2147483647.
*/
#include <stdlib.h> /* strtoull, strtoul, or atoi */
#include <stdio.h> /* printf */
#include "embed.h"
#include "EXTERN.h"
#include "perl.h"
int main(int argc, char *argv[])
{
/* unsigned long long is typical see `grep U64TYPE config.h` */
#if defined(U64TYPE) && ! defined(__MVS__)
U64TYPE start = 0;
U64TYPE stop = 255;
U64TYPE i;
# ifdef UVuf
# undef UVuf /* "lu" is typical */
# define UVuf "Lu" /* ok for gcc on linux-ppc, DECC on VMS */
# endif
U64TYPE uv;
#else
UVTYPE start = 0;
UVTYPE stop = 255;
UVTYPE i;
UV uv;
#endif
int j;
U8 * d;
UV uvr = 0;
STRLEN u8len, *uvlen;
/* U32 flags = 0; */
/* U32 flags = UTF8_CHECK_ONLY; */
U32 flags = UTF8_ALLOW_ANY ; /* */
U8 tmpbuf[UTF8_MAXLEN] = {0,0,0,0,0,0,0,0,0,0,0,0,0};
/* "0 1 2 3 4 5 6 7 8 9 A B C"; */
U8 *cpybuf;
if (argc == 2) {
#ifdef HAS_STRTOULL
stop = strtoull(argv[1],(char **)NULL,10);
#else
# ifdef HAS_STRTOUL
stop = strtoul(argv[1],(char **)NULL,10);
# else
stop = atoi(argv[1]);
# endif
#endif
}
if (argc == 3) {
#ifdef HAS_STRTOULL
start = strtoull(argv[1],(char **)NULL,10);
stop = strtoull(argv[2],(char **)NULL,10);
#else
start = atoi(argv[1]);
stop = atoi(argv[2]);
#endif
}
if (stop < start) {
printf("Usage:\n\t%s (0..255)\n\t%s stop (0..stop)\n\t%s start stop\n",
argv[0],argv[0],argv[0]);
return(1);
}
printf("Start: %"UVuf"\tStop: %"UVuf"\n",start,stop);
printf("i\t->\tutf8 (u8len)\t->\tuv_ret (uvlen)\t->\tuv_cast\n");
for (i = start; i <= stop; i++) {
#ifdef U64TYPE
uv = (U64TYPE) i;
d = uv_to_utf8( tmpbuf, (UV)uv);
#else
uv = (UV) i;
d = uv_to_utf8( tmpbuf, uv);
#endif
printf("%"UVuf"\t->\t%d",i,tmpbuf[0]);
u8len = 0;
for (j = 1; j < UTF8_MAXLEN; j++) {
if (tmpbuf[j] != 0) {
printf(".%d",tmpbuf[j]);
u8len++;
}
else {
j = UTF8_MAXLEN;
}
}
u8len++;
cpybuf = tmpbuf;
uvlen = &u8len;
uvr = utf8_to_uv( cpybuf, u8len , uvlen, flags );
/*
Unfortunately, the sizeof(unsigned int) == sizeof(unsigned long) == 4
on linux-ppc, gccversion='egcs-2.91.66 19990314 (egcs-1.1.2 release)'
but printf with a %lu format generates warnings under -Wall for
the STRLEN entries, whereas the %u format does not.
*/
#ifdef UVuf_OK
printf(" (%"UVuf")\t->\t%ld (%"UVuf")\t->\t%ld\n",u8len,uvr,*uvlen,uv);
#else
# ifdef U64TYPE
printf(" (%u)\t->\t%ld (%u)\t->\t%"UVuf"\n",u8len,uvr,*uvlen,(U64TYPE)uv);
# else
printf(" (%u)\t->\t%ld (%u)\t->\t%ld\n",u8len,uvr,*uvlen,uv);
# endif
#endif
/* ensure that all are reset to zero so as to test each code point
we aren't in any hurry btw
*/
for (j = 0; j < UTF8_MAXLEN; j++) {
tmpbuf[j] = 0;
}
}
return(0);
}
Here for example was the modification made to perl's extracted Makefile
that allowed compiling and linking utf8_tbl:
--- Makefile.orig Tue Dec 5 12:52:57 2000
+++ Makefile Tue Dec 5 12:53:11 2000
@@ -186,6 +186,9 @@
perlmain$(OBJ_EXT): perlmain.c
$(CCCMD) $(PLDLFLAGS) $*.c
+utf8_tbl$(OBJ_EXT): utf8_tbl.c
+ $(CCCMD) $(PLDLFLAGS) $*.c
+
# The file ext.libs is a list of libraries that must be linked in
# for static extensions, e.g. -lm -lgdbm, etc. The individual
# static extension Makefile's add to it.
@@ -214,6 +217,9 @@
perl: $& perlmain$(OBJ_EXT) $(LIBPERL) $(DYNALOADER) $(static_ext) ext.libs $(PERLEXPORT)
$(SHRPENV) $(LDLIBPTH) $(CC) $(CLDFLAGS) $(CCDLFLAGS) -o perl perlmain$(OBJ_EXT) $(DYNALOADER) $(static_ext) $(LLIBPERL) `cat ext.libs` $(libs)
+
+utf8_tbl: $& utf8_tbl$(OBJ_EXT) $(LIBPERL) $(DYNALOADER) $(static_ext) ext.libs $(PERLEXPORT)
+ $(SHRPENV) $(LDLIBPTH) $(CC) $(CLDFLAGS) $(CCDLFLAGS) -o utf8_tbl utf8_tbl$(OBJ_EXT) $(DYNALOADER) $(static_ext) $(LLIBPERL) `cat ext.libs` $(libs)
pureperl: $& perlmain$(OBJ_EXT) $(LIBPERL) $(DYNALOADER) $(static_ext) ext.libs $(PERLEXPORT)
$(SHRPENV) $(LDLIBPTH) purify $(CC) $(CLDFLAGS) $(CCDLFLAGS) -o pureperl perlmain$(OBJ_EXT) $(DYNALOADER) $(static_ext) $(LLIBPERL) `cat ext.libs` $(libs)
End of diff to Makfile hack
Here are the head and tail of a default run on OS/390 (default meaning Start at
0 and proceed to 255):
$ ./utf8_tbl | head
Start: 0 Stop: 255
i -> utf8 (u8len) -> uv_ret (uvlen) -> uv_cast
0 -> 0 (1) -> 0 (1) -> 0
1 -> 1 (1) -> 1 (1) -> 1
2 -> 2 (1) -> 2 (1) -> 2
3 -> 3 (1) -> 3 (1) -> 3
4 -> 55 (1) -> 4 (1) -> 4
5 -> 45 (1) -> 5 (1) -> 5
6 -> 46 (1) -> 6 (1) -> 6
7 -> 47 (1) -> 7 (1) -> 7
CEE5213S The signal SIGPIPE was received.
$ ./utf8_tbl | tail
246 -> 195.140 (2) -> 246 (2) -> 246
247 -> 195.161 (2) -> 247 (2) -> 247
248 -> 112 (1) -> 248 (1) -> 248
249 -> 195.157 (2) -> 249 (2) -> 249
250 -> 195.158 (2) -> 250 (2) -> 250
251 -> 195.155 (2) -> 251 (2) -> 251
252 -> 195.156 (2) -> 252 (2) -> 252
253 -> 194.141 (2) -> 253 (2) -> 253
254 -> 194.142 (2) -> 254 (2) -> 254
255 -> 195.159 (2) -> 255 (2) -> 255
Peter Prymmer