develooper Front page | perl.perl5.porters | Postings from April 2001

[PATCH: perl@9622] documentation tweaks for UTF-EBCDIC support

From:
Prymmer/Kahn
Date:
April 9, 2001 00:09
Subject:
[PATCH: perl@9622] documentation tweaks for UTF-EBCDIC support
Message ID:
Pine.BSF.4.21.0104090006190.21071-100000@shell8.ba.best.com

While it looks like someone has already made a pass at both of these
files they do appear to require just a bit more adjustment.  Among
other notable things is that utf8.pm is no longer a noop on EBCDIC.

Files modified:

    pod/perlunicode.pod
    lib/utf8.pm

diff -ru perl.9622/lib/utf8.pm perl/lib/utf8.pm
--- perl.9622/lib/utf8.pm	Tue Mar 20 19:34:55 2001
+++ perl/lib/utf8.pm	Sun Apr  8 23:52:24 2001
@@ -25,7 +25,7 @@
 
 =head1 NAME
 
-utf8 - Perl pragma to enable/disable UTF-8 in source code
+utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
 
 =head1 SYNOPSIS
 
@@ -38,9 +38,9 @@
 See L<perlunicode> for the exact details.
 
 The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
-program text in the current lexical scope.  The C<no utf8> pragma
-tells Perl to switch back to treating the source text as literal
-bytes in the current lexical scope.
+program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
+platforms).  The C<no utf8> pragma tells Perl to switch back to treating 
+the source text as literal bytes in the current lexical scope.
 
 This pragma is primarily a compatibility device.  Perl versions
 earlier than 5.6 allowed arbitrary bytes in source code, whereas
@@ -48,9 +48,9 @@
 source text.  Until UTF-8 becomes the default format for source
 text, this pragma should be used to recognize UTF-8 in the source.
 When UTF-8 becomes the standard source format, this pragma will
-effectively become a no-op.  This pragma already is a no-op on
-EBCDIC platforms (where it is alright to code perl in EBCDIC
-rather than UTF-8).
+effectively become a no-op.  For convenience in what follows the
+term UTF-X is used to refer to UTF-8 on ASCII and ISO Latin based
+platforms and UTF-EBCDIC on EBCDIC based platforms.
 
 Enabling the C<utf8> pragma has the following effects:
 
@@ -61,16 +61,18 @@
 Bytes in the source text that have their high-bit set will be treated
 as being part of a literal UTF-8 character.  This includes most literals
 such as identifiers, string constants, constant regular expression patterns
-and package names.
+and package names.  On EBCDIC platforms, characters in the C1 control group 
+and the Latin 1 character set are treated as being part of a literal
+UTF-EBCDIC character.
 
 =item *
 
-In the absence of inputs marked as UTF-8, regular expressions within the
+In the absence of inputs marked as UTF-X, regular expressions within the 
 scope of this pragma will default to using character semantics instead
 of byte semantics.
 
     @bytes_or_chars = split //, $data;	# may split to bytes if data
-					# $data isn't UTF-8
+					# $data isn't UTF-X
     {
 	use utf8;			# force char semantics
 	@chars = split //, $data;	# splits characters
@@ -100,7 +102,7 @@
 
 =item * $flag = utf8::decode($string)
 
-Attempts to converts I<$string> in-place from perl's UTF-X encoding into logical characters.
+Attempts to convert I<$string> in-place from perl's UTF-X encoding into logical characters.
 
 =back
 
diff -ru perl.9622/pod/perlunicode.pod perl/pod/perlunicode.pod
--- perl.9622/pod/perlunicode.pod	Thu Apr  5 19:10:32 2001
+++ perl/pod/perlunicode.pod	Mon Apr  9 00:02:29 2001
@@ -47,7 +47,8 @@
 
 However, as a compatibility measure, this pragma must be explicitly used
 to enable recognition of UTF-8 encoded literals and identifiers in the
-source text.
+source text on ASCII based machines or recognize UTF-EBCDIC encoded literals
+and identifiers on EBCDIC based machines.
 
 =back
 
@@ -55,7 +56,7 @@
 
 Beginning with version 5.6, Perl uses logically wide characters to
 represent strings internally.  This internal representation of strings
-uses the UTF-8 encoding.
+uses either the UTF-8 or the UTF-EBCDIC encoding.
 
 In future, Perl-level operations can be expected to work with characters
 rather than bytes, in general.
@@ -84,7 +85,7 @@
 byte semantics in a particular lexical scope.  See L<bytes>.
 
 The C<utf8> pragma is primarily a compatibility device that enables
-recognition of UTF-8 in literals encountered by the parser.  It may also
+recognition of UTF-(8|EBCDIC) in literals encountered by the parser.  It may also
 be used for enabling some of the more experimental Unicode support features.
 Note that this pragma is only required until a future version of Perl
 in which character semantics will become the default.  This pragma may
@@ -104,6 +105,8 @@
 no difference, because UTF-8 stores ASCII in single bytes, but for
 any character greater than C<chr(127)>, the character may be stored in
 a sequence of two or more bytes, all of which have the high bit set.
+For C1 controls or Latin 1 characters on an EBCDIC platform the character
+may be stored in a UTF-EBCDIC multi byte sequence.
 But by and large, the user need not worry about this, because Perl
 hides it from the user.  A character in Perl is logically just a number
 ranging from 0 to 2**32 or so.  Larger characters encode to longer
@@ -122,9 +125,9 @@
 larger than 255.
 
 Presuming you use a Unicode editor to edit your program, such characters
-will typically occur directly within the literal strings as UTF-8
+will typically occur directly within the literal strings as UTF-(8|EBCDIC)
 characters, but you can also specify a particular character with an
-extension of the C<\x> notation.  UTF-8 characters are specified by
+extension of the C<\x> notation.  UTF-X characters are specified by
 putting the hexadecimal code within curlies after the C<\x>.  For instance,
 a Unicode smiley face is C<\x{263A}>.
 
@@ -233,8 +236,8 @@
 =head1 CAVEATS
 
 As of yet, there is no method for automatically coercing input and
-output to some encoding other than UTF-8.  This is planned in the near
-future, however.
+output to some encoding other than UTF-8 or UTF-EBCDIC.  This is planned 
+in the near future, however.
 
 Whether an arbitrary piece of data will be treated as "characters" or
 "bytes" by internal operations cannot be divined at the current time.
End of Patch.


Peter Prymmer







nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About