Front page | perl.perl5.changes |
Postings from December 2008
[perl.git] branch blead, updated. GitLive-blead-95-gf08e058
From:
Rafael Garcia-Suarez
Date:
December 26, 2008 14:28
Subject:
[perl.git] branch blead, updated. GitLive-blead-95-gf08e058
In perl.git, the branch blead has been updated
<http://perl5.git.perl.org/perl.git/commitdiff/f08e0584288c021de71ecd212ba86a45c8f96a5b?hp=eccdc4d715215b93b6b598d8cf3ac12e323f67e0>
- Log -----------------------------------------------------------------
commit f08e0584288c021de71ecd212ba86a45c8f96a5b
Author: Rafael Garcia-Suarez <rgarciasuarez@gmail.com>
Date: Fri Dec 26 23:27:46 2008 +0100
Regen docs and headers
Necessary after change fe749c9aa803ce74d997ff797103481a55741837
M global.sym
M pod/perlapi.pod
M proto.h
commit 42bde815c4743d7e164d2e70c98a6b86a79906b9
Author: Rafael Garcia-Suarez <rgarciasuarez@gmail.com>
Date: Fri Dec 26 23:27:03 2008 +0100
Fix two pod links
M pod/perlebcdic.pod
M pod/perlunicode.pod
commit fe749c9aa803ce74d997ff797103481a55741837
Author: Karl <khw@karl.(none)>
Date: Fri Dec 26 10:18:34 2008 -0700
Update comments and documentation dealing with utf
M embed.fnc
M lib/charnames.pm
M pod/perlebcdic.pod
M pod/perlhack.pod
M pod/perlunicode.pod
M sv.c
M sv.h
M utfebcdic.h
-----------------------------------------------------------------------
Summary of changes:
embed.fnc | 1 +
global.sym | 1 +
lib/charnames.pm | 5 ++++
pod/perlapi.pod | 18 +++++++++++---
pod/perlebcdic.pod | 23 +++++++++--------
pod/perlhack.pod | 15 ++++++-----
pod/perlunicode.pod | 14 +++++++++-
proto.h | 5 ++++
sv.c | 4 +++
sv.h | 8 +++---
utfebcdic.h | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++-
11 files changed, 129 insertions(+), 29 deletions(-)
diff --git a/embed.fnc b/embed.fnc
index 59a99ea..a926a53 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1257,6 +1257,7 @@ AmdbR |char* |sv_pv |NN SV *sv
AmdbR |char* |sv_pvutf8 |NN SV *sv
AmdbR |char* |sv_pvbyte |NN SV *sv
Amdb |STRLEN |sv_utf8_upgrade|NN SV *sv
+Amdb |STRLEN |sv_utf8_upgrade_nomg|NN SV *sv
ApdM |bool |sv_utf8_downgrade|NN SV *const sv|const bool fail_ok
Apd |void |sv_utf8_encode |NN SV *const sv
ApdM |bool |sv_utf8_decode |NN SV *const sv
diff --git a/global.sym b/global.sym
index fe26578..9598d52 100644
--- a/global.sym
+++ b/global.sym
@@ -663,6 +663,7 @@ Perl_sv_pv
Perl_sv_pvutf8
Perl_sv_pvbyte
Perl_sv_utf8_upgrade
+Perl_sv_utf8_upgrade_nomg
Perl_sv_utf8_downgrade
Perl_sv_utf8_encode
Perl_sv_utf8_decode
diff --git a/lib/charnames.pm b/lib/charnames.pm
index 9f9526b..b8eb2b4 100644
--- a/lib/charnames.pm
+++ b/lib/charnames.pm
@@ -541,6 +541,11 @@ past U+10FFFF you do get a warning.)
=head1 BUGS
+Unicode standard named sequences are not recognized, such as
+C<LATIN CAPITAL LETTER A WITH MACRON AND GRAVE>
+(which should mean C<LATIN CAPITAL LETTER A WITH MACRON> with an additional
+C<COMBINING GRAVE ACCENT>).
+
Since evaluation of the translation function happens in a middle of
compilation (of a string literal), the translation function should not
do any C<eval>s or C<require>s. This restriction should be lifted in
diff --git a/pod/perlapi.pod b/pod/perlapi.pod
index 3fb7754..8c3e6d6 100644
--- a/pod/perlapi.pod
+++ b/pod/perlapi.pod
@@ -4091,7 +4091,7 @@ Found in file sv.h
X<SvIOKp>
Returns a U32 value indicating whether the SV contains an integer. Checks
-the B<private> setting. Use C<SvIOK>.
+the B<private> setting. Use C<SvIOK> instead.
U32 SvIOKp(SV* sv)
@@ -4284,7 +4284,7 @@ Found in file sv.h
X<SvNIOKp>
Returns a U32 value indicating whether the SV contains a number, integer or
-double. Checks the B<private> setting. Use C<SvNIOK>.
+double. Checks the B<private> setting. Use C<SvNIOK> instead.
U32 SvNIOKp(SV* sv)
@@ -4315,7 +4315,7 @@ Found in file sv.h
X<SvNOKp>
Returns a U32 value indicating whether the SV contains a double. Checks the
-B<private> setting. Use C<SvNOK>.
+B<private> setting. Use C<SvNOK> instead.
U32 SvNOKp(SV* sv)
@@ -4451,7 +4451,7 @@ Found in file sv.h
X<SvPOKp>
Returns a U32 value indicating whether the SV contains a character string.
-Checks the B<private> setting. Use C<SvPOK>.
+Checks the B<private> setting. Use C<SvPOK> instead.
U32 SvPOKp(SV* sv)
@@ -6544,6 +6544,16 @@ use the Encode extension for that.
=for hackers
Found in file sv.c
+=item sv_utf8_upgrade_nomg
+X<sv_utf8_upgrade_nomg>
+
+Like sv_utf8_upgrade, but doesn't do magic on C<sv>
+
+ STRLEN sv_utf8_upgrade_nomg(SV *sv)
+
+=for hackers
+Found in file sv.c
+
=item sv_vcatpvf
X<sv_vcatpvf>
diff --git a/pod/perlebcdic.pod b/pod/perlebcdic.pod
index ca4ef84..f222e3d 100644
--- a/pod/perlebcdic.pod
+++ b/pod/perlebcdic.pod
@@ -153,20 +153,21 @@ depends on the ordinal number of that code point,
with larger numbers requiring more bytes.
UTF-EBCDIC is like UTF-8, but based on EBCDIC.
-In UTF-8, the code points corresponding to the lowest 128
-ordinal numbers (0 - 127) are the same (or C<invariant>)
-in UTF-8 or not. They occupy one byte each. All other Unicode code points
-require more than one byte to be represented in UTF-8.
-With UTF-EBCDIC, the term C<invariant> has a somewhat different meaning.
-(First, note that this is very different from the L</13 variant characters>
+You may see the term C<invariant> character or code point.
+This simply means that the character has the same numeric
+value when encoded as when not.
+(Note that this is a very different concept from L</The 13 variant characters>
mentioned above.)
-In UTF-EBCDIC, an C<invariant> character or code point
-is one which takes up exactly one byte encoded, regardless
-of whether or not the encoding changes its value
-(which it most likely will).
+For example, the ordinal value of 'A' is 193 in most EBCDIC code pages,
+and also is 193 when encoded in UTF-EBCDIC.
+All other code points occupy at least two bytes when encoded.
+In UTF-8, the code points corresponding to the lowest 128
+ordinal numbers (0 - 127: the ASCII characters) are invariant.
+In UTF-EBCDIC, there are 160 invariant characters.
(If you care, the EBCDIC invariants are those characters
-which correspond to the the ASCII characters, plus those that correspond to
+which have ASCII equivalents, plus those that correspond to
the C1 controls (80..9f on ASCII platforms).)
+
A string encoded in UTF-EBCDIC may be longer (but never shorter) than
one encoded in UTF-8.
diff --git a/pod/perlhack.pod b/pod/perlhack.pod
index dde67a3..c5f249e 100644
--- a/pod/perlhack.pod
+++ b/pod/perlhack.pod
@@ -214,7 +214,7 @@ changes.
How to clone and use the git perl repository is described in L<perlrepository>.
You can also choose to use rsync to get a copy of the current source tree
-for the bleadperl branch and all maintainance branches :
+for the bleadperl branch and all maintenance branches :
$ rsync -avz rsync://perl5.git.perl.org/APC/perl-current .
$ rsync -avz rsync://perl5.git.perl.org/APC/perl-5.10.x .
@@ -263,7 +263,7 @@ you're fixing a bug in the 5.8 track, patch against the C<blead> branch in
the git repository.)
If changes are accepted, they are applied to the development branch. Then
-the maintainance pumpking decides which of those patches is to be
+the maintenance pumpking decides which of those patches is to be
backported to the maint branch. Only patches that survive the heat of the
development branch get applied to maintenance versions.
@@ -2332,16 +2332,17 @@ about other ranges.
Many of the comments in the existing code ignore the possibility of EBCDIC,
and may be wrong therefore, even if the code works.
This is actually a tribute to the successful transparent insertion of being
-able to handle EBCDIC. without having to change pre-existing code.
+able to handle EBCDIC without having to change pre-existing code.
UTF-8 and UTF-EBCDIC are two different encodings used to represent Unicode
code points as sequences of bytes. Macros
with the same names (but different definitions)
in C<utf8.h> and C<utfebcdic.h>
-are used to allow the calling code think that there is only one such encoding.
-This is almost always referred to as C<utf8>, but it means the EBCDIC
-version as well. Comments in the code may well be wrong even if the code
-itself is right.
+are used to allow the calling code to think that there is only one such
+encoding.
+This is almost always referred to as C<utf8>, but it means the EBCDIC version
+as well. Again, comments in the code may well be wrong even if the code itself
+is right.
For example, the concept of C<invariant characters> differs between ASCII and
EBCDIC.
On ASCII platforms, only characters that do not have the high-order
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 068b2f3..3a52933 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -116,7 +116,7 @@ be used to force byte semantics on Unicode data.
If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will have
-character semantics.
+character semantics. This can cause surprises: See L</BUGS>, below
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
@@ -1451,7 +1451,8 @@ This can lead to unexpected results in which a string's semantics suddenly
change if a code point above 255 is appended to or removed from it,
which changes the string's semantics from byte to character or vice versa.
This behavior is scheduled to change in version 5.12, but in the meantime,
-a workaround is to always call utf8::upgrade($string).
+a workaround is to always call utf8::upgrade($string), or to use the
+standard modules L<Encode> or L<charnames>.
=head2 Interaction with Extensions
@@ -1533,6 +1534,15 @@ be quite a bit slower (5-20 times) than their simpler counterparts
like C<\d> (then again, there 268 Unicode characters matching C<Nd>
compared with the 10 ASCII characters matching C<d>).
+=head2 Possible problems on EBCDIC platforms
+
+In earlier versions, when byte and character data were concatenated,
+the new string was sometimes created by
+decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
+old Unicode string used EBCDIC.
+
+If you find any of these, please report them as bugs.
+
=head2 Porting code from perl-5.6.X
Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
diff --git a/proto.h b/proto.h
index 3ec32c5..06e305b 100644
--- a/proto.h
+++ b/proto.h
@@ -3969,6 +3969,11 @@ PERL_CALLCONV void Perl_reginitcolors(pTHX);
#define PERL_ARGS_ASSERT_SV_UTF8_UPGRADE \
assert(sv)
+/* PERL_CALLCONV STRLEN Perl_sv_utf8_upgrade_nomg(pTHX_ SV *sv)
+ __attribute__nonnull__(pTHX_1); */
+#define PERL_ARGS_ASSERT_SV_UTF8_UPGRADE_NOMG \
+ assert(sv)
+
PERL_CALLCONV bool Perl_sv_utf8_downgrade(pTHX_ SV *const sv, const bool fail_ok)
__attribute__nonnull__(pTHX_1);
#define PERL_ARGS_ASSERT_SV_UTF8_DOWNGRADE \
diff --git a/sv.c b/sv.c
index cfae3b7..917c897 100644
--- a/sv.c
+++ b/sv.c
@@ -3154,6 +3154,10 @@ Returns the number of bytes in the converted string
This is not as a general purpose byte encoding to Unicode interface:
use the Encode extension for that.
+=for apidoc sv_utf8_upgrade_nomg
+
+Like sv_utf8_upgrade, but doesn't do magic on C<sv>
+
=for apidoc sv_utf8_upgrade_flags
Converts the PV of an SV to its UTF-8-encoded form.
diff --git a/sv.h b/sv.h
index a09a134..43bc541 100644
--- a/sv.h
+++ b/sv.h
@@ -593,7 +593,7 @@ double.
=for apidoc Am|U32|SvNIOKp|SV* sv
Returns a U32 value indicating whether the SV contains a number, integer or
-double. Checks the B<private> setting. Use C<SvNIOK>.
+double. Checks the B<private> setting. Use C<SvNIOK> instead.
=for apidoc Am|void|SvNIOK_off|SV* sv
Unsets the NV/IV status of an SV.
@@ -604,15 +604,15 @@ whether the value is defined or not.
=for apidoc Am|U32|SvIOKp|SV* sv
Returns a U32 value indicating whether the SV contains an integer. Checks
-the B<private> setting. Use C<SvIOK>.
+the B<private> setting. Use C<SvIOK> instead.
=for apidoc Am|U32|SvNOKp|SV* sv
Returns a U32 value indicating whether the SV contains a double. Checks the
-B<private> setting. Use C<SvNOK>.
+B<private> setting. Use C<SvNOK> instead.
=for apidoc Am|U32|SvPOKp|SV* sv
Returns a U32 value indicating whether the SV contains a character string.
-Checks the B<private> setting. Use C<SvPOK>.
+Checks the B<private> setting. Use C<SvPOK> instead.
=for apidoc Am|U32|SvIOK|SV* sv
Returns a U32 value indicating whether the SV contains an integer.
diff --git a/utfebcdic.h b/utfebcdic.h
index 8659b19..bb88571 100644
--- a/utfebcdic.h
+++ b/utfebcdic.h
@@ -9,6 +9,66 @@
* Macros to implement UTF-EBCDIC as perl's internal encoding
* Taken from version 7.1 of Unicode Techical Report #16:
* http://www.unicode.org/unicode/reports/tr16
+ *
+ * To summarize, the way it works is:
+ * To convert an EBCDIC character to UTF-EBCDIC:
+ * 1) convert to Unicode. The table in this file that does this is for
+ * EBCDIC bytes is PL_e2a (with inverse PLa2e). The 'a' stands for
+ * ASCIIish, meaning latin1.
+ * 2) convert that to a utf8-like string called I8 with variant characters
+ * occupying multiple bytes. This step is similar to the utf8-creating
+ * step from Unicode, but the details are different. There is a chart
+ * about the bit patterns in a comment later in this file. But
+ * essentially here are the differences:
+ * UTF8 I8
+ * invariant byte starts with 0 starts with 0 or 100
+ * continuation byte starts with 10 starts with 101
+ * start byte same in both: if the code point requires N bytes,
+ * then the leading N bits are 1, followed by a 0. (No
+ * trailing 0 for the very largest possible allocation
+ * in I8, far beyond the current Unicode standard's
+ * max, as shown in the comment later in this file.)
+ * 3) Use the table published in tr16 to convert each byte from step 2 into
+ * final UTF-EBCDIC. The table in this file is PL_utf2e, and its invverse
+ * is PL_e2utf. They are constructed so that all EBCDIC invariants remain
+ * invariant, but no others do. For example, the ordinal value of 'A' is
+ * 193 in EBCDIC, and also is 193 in UTF-EBCDIC. Step 1) converts it to
+ * 65, Step 2 leaves it at 65, and Step 3 converts it back to 193. As an
+ * example of how a variant character works, take LATIN SMALL LETTER Y
+ * WITH DIAERESIS, which is typicially 0xDF in EBCDIC. Step 1 converts it
+ * to the Unicode value, 0xFF. Step 2 converts that to two bytes =
+ * 11000111 10111111 = C7 BF, and Step 3 converts those to 0x47 0xE7
+ *
+ * If you're starting from Unicode, skip step 1. For UTF-EBCDIC to straight
+ * EBCDIC, reverse the steps.
+ *
+ * The EBCDIC invariants have been chosen to be those characters whose Unicode
+ * equivalents have ordinal numbers less than 160, that is the same characters
+ * that are expressible in ASCII, plus the C1 controls. So there are 160
+ * invariants instead of the 128 in UTF-8. (My guess is that this is because
+ * the C1 control NEL (and maybe others) is important in IBM.)
+ *
+ * The purpose of Step 3 is to make the encoding be invariant for the chosen
+ * characters. This messes up the convenient patterns found in step 2, so
+ * generally, one has to undo step 3 into a temporary to use them. However,
+ * a "shadow", or parallel table, PL_utf8skip, has been constructed so that for
+ * each byte, it says how long the sequence is if that byte were to begin it
+ *
+ * There are actually 3 slightly different UTF-EBCDIC encodings in this file,
+ * one for each of the code pages recognized by Perl. That means that there
+ * are actually three different sets of tables, one for each code page. (If
+ * Perl is compiled on platforms using other EBCDIC code pages, it may not
+ * compile, or silently mistake it for one of the three.)
+ *
+ * EBCDIC characters above 0xFF are the same as Unicode in Perl's
+ * implementation of all 3 encodings, so for those Step 1 is trivial.
+ *
+ * (Note that the entries for invariant characters are necessarily the same in
+ * PL_e2a and PLe2f, and the same for their inverses.)
+ *
+ * UTF-EBCDIC strings are the same length or longer than UTF-8 representations
+ * of the same string. The maximum code point representable as 2 bytes in
+ * UTF-EBCDIC is 0x3FFF, instead of 0x7FFF in UTF-8.
*/
START_EXTERN_C
@@ -82,7 +142,9 @@ unsigned char PL_utf8skip[] = {
};
#endif
-/* Transform tables from tr16 applied after encoding to render encoding EBCDIC like */
+/* Transform tables from tr16 applied after encoding to render encoding EBCDIC
+ * like, meaning that all the invariants are actually invariant, eg, that 'A'
+ * remains 'A' */
#if '^' == 95 /* if defined(__MVS__) || defined(??) (VM/ESA?) 1047 */
EXTCONST unsigned char PL_utf2e[] = { /* UTF-8-mod to EBCDIC (IBM-1047) */
--
Perl5 Master Repository
-
[perl.git] branch blead, updated. GitLive-blead-95-gf08e058
by Rafael Garcia-Suarez