develooper Front page | perl.perl5.porters | Postings from April 2013

Status of z/OS/EBCDIC

Thread Next
From:
Karl Williamson
Date:
April 20, 2013 04:22
Subject:
Status of z/OS/EBCDIC
Message ID:
517217D2.6050208@khwilliamson.com
tl;dr

86% of non-cpan tests currently pass completely.  A single bug or group 
appears to be responsible for most of the remaining failures:  the code 
to emit warnings does not work properly, causing many of the tests that 
deal with warnings to fail

=================

John Goodyear and I have been working on getting EBCDIC support in blead 
to work again.  He is going to be AFK for a while, so I think this is a 
good time to give a status report.

IBM has enhanced z/OS so that libraries can be written that work on both 
EBCDIC and ASCII input/output. As a separate effort, John has tried to 
get that working, but it isn't as easy as the documentation makes it 
appear, and so far, he hasn't been able to get Perl to compile.  If he 
were able, it would mean that Perl could run on modern z/OS without 
requiring EBCDIC support, which would mean that all of CPAN would work. 
  As it is, many CPAN modules won't work on an EBCDIC system.  For 
instance, he hasn't been able to get Test::Smoke to work.

In the EBCDIC support porting of our work, I changed things to skip the 
upstream-cpan module tests.  This is because there were lots of failures 
that generated lots of error messages.  And we really have little to say 
about whether these modules should work under EBCDIC or not.  Indeed, I 
looked at the JSON code, and found it was heavily reliant on 
Unicode/ASCII, and then I looked at its spec, and found that it was 
defined in terms of Unicode.  Without thinking about it very much, I 
don't know how it should even work on EBCDIC platforms, and it appears 
to me that the amount of effort involved in getting it to work is quite 
large, more than I am willing to undertake, even if its maintainers were 
amenable to that.  Similarly, I don't know what to do about http/xml 
code that should interface with ASCII platforms, and I don't know about 
digest/compression code either.  Turning off testing of the whole thing 
seemed the best thing to me to do, until we get all the core tests 
passing.  Our pass rate went up 3-4% as a result of skipping cpan.

That pass rate now stands at 86%.  It would be quite a bit higher, I 
believe, except for what appears to be one bug or one group of related 
bugs in the warnings subsystem.  Just yesterday, before John left, we 
did a debugging session to try to find it, and I discovered it was very 
different from what I expected, and in an area that I don't know much 
about.  Things like this macro don't work:

#define isLEXWARN_on 	(PL_curcop->cop_warnings != pWARN_STD)

Lexical warnings appear to be ineffective.  I'll have to familiarize 
myself with that code; if something jumps out at you, ideas welcome.  I 
had expected that it was an endian problem or some such, but it appears 
that the bit masks of warnings are properly being manipulated, but the 
basic things dealing with are there lexical warnings or not, don't.
Many tests in our suite check that warnings come out as expected; most 
of these tests now fail because of this bug(s), but otherwise would pass.

One of the issues we had to address is that Perl is now delivered in 
multiple encodings, not just 8859-1.  Many of the files are in UTF-8; 
some are in other encodings like Latin2, and some in binary.  leont++ 
just changed some files in blead away from being East Asian encodings to 
UTF-8.  A few files contain multiple encodings.  There was no easy way 
we found to convert one of our UTF-8 encoded files into something that 
worked on EBCDIC, so I wrote a little miniperl utility that does that, 
while leaving binary and other encodings files unchanged.  (The 
non-ASCII portions of .c and .h files are in the comments, so we can 
compile miniperl even if those don't get translated properly.)

Most of the bugs found so far have been in the tests.  The core worked 
mostly correctly, but the tests were checking for ASCII-centric results.
A typical thing would be to expect that the two byte sequence "\xc4\x80" 
is the internal representation of chr(100), or that \xDF is LATIN SMALL 
LETTER SHARP S, or that \x0A or \12 or 10 is LINE FEED.  It's slow going 
finding and fixing all these.  And you never know if this is just 
another issue specific to this .t, or something in the core.  I've added 
infrastructure code to t/test.pl to make it reasonably simple for the .t 
to work on both ASCII and EBCDIC.  For example t/test.pl now has the 
function byte_utf8a_to_utf8n() which takes a sequence of bytes in ASCII 
UTF-8 and returns the native equivalent.  So 
byte_utf8a_to_utf8n("\xc4\x80") yields the byte sequence for chr(100) on 
the current platform, whatever it is.  (This happens to be "\x8C\x41" on 
John's 1047 EBCDIC code page.)

One change that I have made is toCTRL('?').  On ASCII machines this 
yields the DEL character.  Until now, on EBCDIC machines, this yielded 
the non-control '"', a double quote.  There is code that assumes it will 
always be a control, so it really should map to a real control.  We 
can't have it map to DEL, as that is already toCTRL('G') on EBCDIC 
machines, and toCTRL(<DEL>) yields 'G'.  Thus '?' needs to map to a 
different control.  I chose the control character on EBCDIC that isn't 
in the block of the other controls, just as DEL isn't in the block of C0 
controls on ASCII systems.  That control is APC, which on the 1047 and 
0037 code pages is 0xFF, and is often called EIGHT ONES on those systems.

One change that I haven't made, but think advisable is the 'u' format in 
pack/unpack.  That is uuencode and uudecode.  It seems to me that this 
format should be the same no matter what the source platform is, so that 
an EBCDIC machine could exchange information easily with an ASCII one. 
Thus I think the 'u' format should transform into the precise bit 
patterns that would happen on an ASCII platform.  I haven't looked 
closely, but it appears that we go to a lot of trouble, generating 
uudmap.h, that could all be avoided by just converting to ASCII as part 
of the 'u' format, and would allow transparent communication between 
machines with different character sets.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About