develooper Front page | perl.perl5.changes | Postings from March 2019

[perl.git] branch blead updated. v5.29.8-108-g912b808cb4

From:
Karl Williamson
Date:
March 14, 2019 18:18
Subject:
[perl.git] branch blead updated. v5.29.8-108-g912b808cb4
Message ID:
E1h4Uve-00075b-8h@git.dc.perl.space
In perl.git, the branch blead has been updated

<https://perl5.git.perl.org/perl.git/commitdiff/912b808cb4fcd596e07f77898c626f5567fbe994?hp=bfa9f5ee70ce509f0e66dcff9e9fda131ea8a133>

- Log -----------------------------------------------------------------
commit 912b808cb4fcd596e07f77898c626f5567fbe994
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Mar 14 11:50:10 2019 -0600

    regnodes.h, perldebguts: Shorten some descriptions

commit f4e61fc03836484ea88518e8bf04cc1b32a6a1a0
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Mar 14 11:48:11 2019 -0600

    Any Common digit set can match in any script
    
    This fixes a design flaw in script runs that in 5.30 effectively
    prevented digits from the Common script except the ASCII [0-9] from
    being in any meaningful script run.

-----------------------------------------------------------------------

Summary of changes:
 pod/perldebguts.pod | 37 +++++++++++++++++--------------------
 pod/perldelta.pod   | 19 +++++++++++++++++++
 pod/perlre.pod      | 19 ++++++++-----------
 regcomp.sym         | 20 ++++++++++----------
 regexec.c           | 39 ++++++++++++---------------------------
 regnodes.h          | 20 ++++++++++----------
 t/re/script_run.t   | 19 +++++++++++++++++--
 7 files changed, 93 insertions(+), 80 deletions(-)

diff --git a/pod/perldebguts.pod b/pod/perldebguts.pod
index 2aa906e903..ff2eaed89b 100644
--- a/pod/perldebguts.pod
+++ b/pod/perldebguts.pod
@@ -587,7 +587,7 @@ will be lost.
  BOUNDL           no         Like BOUND/BOUNDU, but \w and \W are
                              defined by current locale
  BOUNDU           no         Match "" at any boundary of a given type
-                             using Unicode rules
+                             using /u rules.
  BOUNDA           no         Match "" at any boundary between \w\W or
                              \W\w, where \w is [_a-zA-Z0-9]
  NBOUND           no         Like NBOUNDA for non-utf8, otherwise match
@@ -595,7 +595,7 @@ will be lost.
  NBOUNDL          no         Like NBOUND/NBOUNDU, but \w and \W are
                              defined by current locale
  NBOUNDU          no         Match "" at any non-boundary of a given
-                             type using using Unicode rules
+                             type using using /u rules.
  NBOUNDA          no         Match "" betweeen any \w\w or \W\W, where
                              \w is [_a-zA-Z0-9]
 
@@ -720,28 +720,25 @@ will be lost.
  SRCLOSE          none       Close preceding SROPEN
 
  REF              num 1      Match some already matched string
- REFF             num 1      Match already matched string, folded using
-                             native charset rules for non-utf8
- REFFL            num 1      Match already matched string, folded in
-                             loc.
- REFFU            num 1      Match already matched string, folded using
-                             unicode rules for non-utf8
- REFFA            num 1      Match already matched string, folded using
-                             unicode rules for non-utf8, no mixing
-                             ASCII, non-ASCII
+ REFF             num 1      Match already matched string, using /di
+                             rules.
+ REFFL            num 1      Match already matched string, using /li
+                             rules.
+ REFFU            num 1      Match already matched string, usng /ui.
+ REFFA            num 1      Match already matched string, using /aai
+                             rules.
 
  # Named references.  Code in regcomp.c assumes that these all are after
  # the numbered references
  NREF             no-sv 1    Match some already matched string
- NREFF            no-sv 1    Match already matched string, folded using
-                             native charset rules for non-utf8
- NREFFL           no-sv 1    Match already matched string, folded in
-                             loc.
- NREFFU           num 1      Match already matched string, folded using
-                             unicode rules for non-utf8
- NREFFA           num 1      Match already matched string, folded using
-                             unicode rules for non-utf8, no mixing
-                             ASCII, non-ASCII
+ NREFF            no-sv 1    Match already matched string, using /di
+                             rules.
+ NREFFL           no-sv 1    Match already matched string, using /li
+                             rules.
+ NREFFU           num 1      Match already matched string, using /ui
+                             rules.
+ NREFFA           num 1      Match already matched string, using /aai
+                             rules.
 
  # Support for long RE
  LONGJMP          off 1 1    Jump far away.
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 06ae872679..68f4ba9fac 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -91,6 +91,20 @@ It croaks if it would otherwise return a UTF-8 string that contains
 malformed UTF-8.  This protects agains potential security threats.  This
 is considered a bug fix as well ([perl #131642]).
 
+=head2 Any set of digits in the Common script are legal in a script run
+of another script
+
+There are several sets of digits in the Common script.  C<[0-9]> is the
+most familiar.  But there are also C<[\x{FF10}-\x{FF19}]> (FULLWIDTH
+DIGIT ZERO - FULLWIDTH DIGIT NINE), and several sets for use in
+mathematical notation, such as the MATHEMATICAL DOUBLE-STRUCK DIGITs.
+Any of these sets should be able to appear in script runs of, say,
+Greek.  But the design of 5.30 overlooked all but the ASCII digits
+C<[0-9]>, so the design was flawed.  This has been fixed, so is both a
+bug fix and an incompatibility. [perl #133547]
+
+All digits in a run still have to come from the same set of ten digits.
+
 =head1 Deprecations
 
 XXX Any deprecated features, syntax, modules etc. should be listed here.
@@ -430,6 +444,11 @@ C<pack()> no longer can return malformed UTF-8.  It croaks if it would
 otherwise return a UTF-8 string that contains malformed UTF-8.  This
 protects agains potential security threats.  [perl #131642]
 
+=item *
+
+See L</Any set of digits in the Common script are legal in a script run
+of another script>.
+
 =back
 
 =head1 Known Problems
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 209cac7f8d..4898f94d9f 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -2550,15 +2550,12 @@ Katakana and Hiragana are commonly mixed together in practice, along
 with some Chinese characters, and hence are treated as being in a single
 script run by Perl.
 
-The rules used for matching decimal digits are somewhat different.  Many
+The rules used for matching decimal digits are slightly stricter.  Many
 scripts have their own sets of digits equivalent to the Western C<0>
 through C<9> ones.  A few, such as Arabic, have more than one set.  For
 a string to be considered a script run, all digits in it must come from
-the same set, as determined by the first digit encountered. The ASCII
-C<[0-9]> are accepted as being in any script, even those that have their
-own set.  This is because these are often used in commerce even in such
-scripts.  But any mixing of the ASCII and other digits will cause the
-sequence to not be a script run, failing the match.  As an example,
+the same set of ten, as determined by the first digit encountered.
+As an example,
 
  qr/(*script_run: \d+ \b )/x
 
@@ -2579,11 +2576,11 @@ accent of some type.  These are considered to be in the script of the
 master character, and so never cause a script run to not match.
 
 The other one is "Common".  This consists of mostly punctuation, emoji,
-and characters used in mathematics and music, and the ASCII digits C<0>
-through C<9>.  These characters can appear intermixed in text in many of
-the world's scripts.  These also don't cause a script run to not match,
-except any ASCII digits encountered have to obey the decimal digit rules
-described above.
+and characters used in mathematics and music, the ASCII digits C<0>
+through C<9>, and full-width forms of these digits.  These characters
+can appear intermixed in text in many of the world's scripts.  These
+also don't cause a script run to not match.  But like other scripts, all
+digits in a run must come from the same set of 10.
 
 This construct is non-capturing.  You can add parentheses to I<pattern>
 to capture, if desired.  You will have to do this if you plan to use
diff --git a/regcomp.sym b/regcomp.sym
index 09a21e9cc0..4b9a42c338 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -47,12 +47,12 @@ GPOS        GPOS,       no        ; Matches where last m//g left off.
 # BOUND, POSIX and their complements are affected, as well as EXACTF.
 BOUND       BOUND,      no        ; Like BOUNDA for non-utf8, otherwise match "" between any Unicode \w\W or \W\w
 BOUNDL      BOUND,      no        ; Like BOUND/BOUNDU, but \w and \W are defined by current locale
-BOUNDU      BOUND,      no        ; Match "" at any boundary of a given type using Unicode rules
+BOUNDU      BOUND,      no        ; Match "" at any boundary of a given type using /u rules.
 BOUNDA      BOUND,      no        ; Match "" at any boundary between \w\W or \W\w, where \w is [_a-zA-Z0-9]
 # All NBOUND nodes are required by code in regexec.c to be greater than all BOUND ones
 NBOUND      NBOUND,     no        ; Like NBOUNDA for non-utf8, otherwise match "" between any Unicode \w\w or \W\W
 NBOUNDL     NBOUND,     no        ; Like NBOUND/NBOUNDU, but \w and \W are defined by current locale
-NBOUNDU     NBOUND,     no        ; Match "" at any non-boundary of a given type using using Unicode rules
+NBOUNDU     NBOUND,     no        ; Match "" at any non-boundary of a given type using using /u rules.
 NBOUNDA     NBOUND,     no        ; Match "" betweeen any \w\w or \W\W, where \w is [_a-zA-Z0-9]
 
 #* [Special] alternatives:
@@ -156,21 +156,21 @@ SROPEN      SROPEN,     none      ; Same as OPEN, but for script run
 SRCLOSE     SRCLOSE,    none      ; Close preceding SROPEN
 
 REF         REF,        num 1 V   ; Match some already matched string
-REFF        REF,        num 1 V   ; Match already matched string, folded using native charset rules for non-utf8
-REFFL       REF,        num 1 V   ; Match already matched string, folded in loc.
+REFF        REF,        num 1 V   ; Match already matched string, using /di rules.
+REFFL       REF,        num 1 V   ; Match already matched string, using /li rules.
 # N?REFF[AU] could have been implemented using the FLAGS field of the
 # regnode, but by having a separate node type, we can use the existing switch
 # statement to avoid some tests
-REFFU       REF,        num 1 V   ; Match already matched string, folded using unicode rules for non-utf8
-REFFA       REF,        num 1 V   ; Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII
+REFFU       REF,        num 1 V   ; Match already matched string, usng /ui.
+REFFA       REF,        num 1 V   ; Match already matched string, using /aai rules.
 
 #*Named references.  Code in regcomp.c assumes that these all are after
 #*the numbered references
 NREF        REF,        no-sv 1 V ; Match some already matched string
-NREFF       REF,        no-sv 1 V ; Match already matched string, folded using native charset rules for non-utf8
-NREFFL      REF,        no-sv 1 V ; Match already matched string, folded in loc.
-NREFFU      REF,        num   1 V ; Match already matched string, folded using unicode rules for non-utf8
-NREFFA      REF,        num   1 V ; Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII
+NREFF       REF,        no-sv 1 V ; Match already matched string, using /di rules.
+NREFFL      REF,        no-sv 1 V ; Match already matched string, using /li rules.
+NREFFU      REF,        num   1 V ; Match already matched string, using /ui rules.
+NREFFA      REF,        num   1 V ; Match already matched string, using /aai rules.
 
 #*Support for long RE
 LONGJMP     LONGJMP,    off 1 . 1 ; Jump far away.
diff --git a/regexec.c b/regexec.c
index 64a65462b5..dff221a99c 100644
--- a/regexec.c
+++ b/regexec.c
@@ -10252,11 +10252,13 @@ Additionally all decimal digits must come from the same consecutive sequence of
 
 For example, if all the characters in the sequence are Greek, or Common, or
 Inherited, this function will return TRUE, provided any decimal digits in it
-are the ASCII digits "0".."9".  For scripts (unlike Greek) that have their own
-digits defined this will accept either digits from that set or from 0..9, but
-not a combination of the two.  Some scripts, such as Arabic, have more than one
-set of digits.  All digits must come from the same set for this function to
-return TRUE.
+are from the same block of digits in Common.  (These are the ASCII digits
+"0".."9" and additionally a block for full width forms of these, and several
+others used in mathematical notation.)   For scripts (unlike Greek) that have
+their own digits defined this will accept either digits from that set or from
+one of the Common digit sets, but not a combination of the two.  Some scripts,
+such as Arabic, have more than one set of digits.  All digits must come from
+the same set for this function to return TRUE.
 
 C<*ret_script>, if C<ret_script> is not NULL, will on return of TRUE
 contain the script found, using the C<SCX_enum> typedef.  Its value will be
@@ -10359,10 +10361,9 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, const bool utf8_target)
         UV cp;
 
         /* The code allows all scripts to use the ASCII digits.  This is
-         * because they are used in commerce even in scripts that have their
-         * own set.  Hence any ASCII ones found are ok, unless and until a
-         * digit from another set has already been encountered.  (The other
-         * digit ranges in Common are not similarly blessed) */
+         * because they are in the Common script.  Hence any ASCII ones found
+         * are ok, unless and until a digit from another set has already been
+         * encountered.  digit ranges in Common are not similarly blessed) */
         if (UNLIKELY(isDIGIT(*s))) {
             if (UNLIKELY(script_of_run == SCX_Unknown)) {
                 retval = FALSE;
@@ -10456,19 +10457,11 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, const bool utf8_target)
         /* If the run so far is Common, and the new character isn't, change the
          * run's script to that of this character */
         if (script_of_run == SCX_Common && script_of_char != SCX_Common) {
-
-            /* But Common contains several sets of digits.  Only the '0' set
-             * can be part of another script. */
-            if (zero_of_run && zero_of_run != '0') {
-                retval = FALSE;
-                break;
-            }
-
             script_of_run = script_of_char;
         }
 
-        /* Now we can see if the script of the character is the same as that of
-         * the run */
+        /* Now we can see if the script of the new character is the same as
+         * that of the run */
         if (LIKELY(script_of_char == script_of_run)) {
             /* By far the most common case */
             goto scripts_match;
@@ -10668,14 +10661,6 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, const bool utf8_target)
                 break;
             }
         }
-        else if (script_of_char == SCX_Common && script_of_run != SCX_Common) {
-
-            /* Here, the script run isn't Common, but the current digit is in
-             * Common, and isn't '0'-'9' (those were handled earlier).   Only
-             * '0'-'9' are acceptable in non-Common scripts. */
-            retval = FALSE;
-            break;
-        }
         else {  /* Otherwise we now have a zero for this run */
             zero_of_run = zero_of_char;
         }
diff --git a/regnodes.h b/regnodes.h
index 412a630561..3b53c1715f 100644
--- a/regnodes.h
+++ b/regnodes.h
@@ -21,11 +21,11 @@
 #define	GPOS                  	7	/* 0x07 Matches where last m//g left off. */
 #define	BOUND                 	8	/* 0x08 Like BOUNDA for non-utf8, otherwise match "" between any Unicode \w\W or \W\w */
 #define	BOUNDL                	9	/* 0x09 Like BOUND/BOUNDU, but \w and \W are defined by current locale */
-#define	BOUNDU                	10	/* 0x0a Match "" at any boundary of a given type using Unicode rules */
+#define	BOUNDU                	10	/* 0x0a Match "" at any boundary of a given type using /u rules. */
 #define	BOUNDA                	11	/* 0x0b Match "" at any boundary between \w\W or \W\w, where \w is [_a-zA-Z0-9] */
 #define	NBOUND                	12	/* 0x0c Like NBOUNDA for non-utf8, otherwise match "" between any Unicode \w\w or \W\W */
 #define	NBOUNDL               	13	/* 0x0d Like NBOUND/NBOUNDU, but \w and \W are defined by current locale */
-#define	NBOUNDU               	14	/* 0x0e Match "" at any non-boundary of a given type using using Unicode rules */
+#define	NBOUNDU               	14	/* 0x0e Match "" at any non-boundary of a given type using using /u rules. */
 #define	NBOUNDA               	15	/* 0x0f Match "" betweeen any \w\w or \W\W, where \w is [_a-zA-Z0-9] */
 #define	REG_ANY               	16	/* 0x10 Match any one character (except newline). */
 #define	SANY                  	17	/* 0x11 Match any one character. */
@@ -72,15 +72,15 @@
 #define	SROPEN                	58	/* 0x3a Same as OPEN, but for script run */
 #define	SRCLOSE               	59	/* 0x3b Close preceding SROPEN */
 #define	REF                   	60	/* 0x3c Match some already matched string */
-#define	REFF                  	61	/* 0x3d Match already matched string, folded using native charset rules for non-utf8 */
-#define	REFFL                 	62	/* 0x3e Match already matched string, folded in loc. */
-#define	REFFU                 	63	/* 0x3f Match already matched string, folded using unicode rules for non-utf8 */
-#define	REFFA                 	64	/* 0x40 Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII */
+#define	REFF                  	61	/* 0x3d Match already matched string, using /di rules. */
+#define	REFFL                 	62	/* 0x3e Match already matched string, using /li rules. */
+#define	REFFU                 	63	/* 0x3f Match already matched string, usng /ui. */
+#define	REFFA                 	64	/* 0x40 Match already matched string, using /aai rules. */
 #define	NREF                  	65	/* 0x41 Match some already matched string */
-#define	NREFF                 	66	/* 0x42 Match already matched string, folded using native charset rules for non-utf8 */
-#define	NREFFL                	67	/* 0x43 Match already matched string, folded in loc. */
-#define	NREFFU                	68	/* 0x44 Match already matched string, folded using unicode rules for non-utf8 */
-#define	NREFFA                	69	/* 0x45 Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII */
+#define	NREFF                 	66	/* 0x42 Match already matched string, using /di rules. */
+#define	NREFFL                	67	/* 0x43 Match already matched string, using /li rules. */
+#define	NREFFU                	68	/* 0x44 Match already matched string, using /ui rules. */
+#define	NREFFA                	69	/* 0x45 Match already matched string, using /aai rules. */
 #define	LONGJMP               	70	/* 0x46 Jump far away. */
 #define	BRANCHJ               	71	/* 0x47 BRANCH with long offset. */
 #define	IFMATCH               	72	/* 0x48 Succeeds if the following matches; non-zero flags "f" means lookbehind assertion starting "f" characters before current */
diff --git a/t/re/script_run.t b/t/re/script_run.t
index 035a9104aa..19d4e10e53 100644
--- a/t/re/script_run.t
+++ b/t/re/script_run.t
@@ -51,8 +51,8 @@ foreach my $type ('script_run', 'sr', 'atomic_script_run', 'asr') {
     unlike("\N{HEBREW LETTER ALEF}\N{HEBREW LETTER TAV}\N{MODIFIER LETTER SMALL Y}", $script_run, "Hebrew then Latin isn't a script run");
     like("9876543210\N{DESERET SMALL LETTER WU}", $script_run, "0-9 are the digits for Deseret");
     like("\N{DESERET SMALL LETTER WU}9876543210", $script_run, "Also when they aren't in the initial position");
-    unlike("\N{DESERET SMALL LETTER WU}\N{FULLWIDTH DIGIT FIVE}", $script_run, "Fullwidth digits aren't the digits for Deseret");
-    unlike("\N{FULLWIDTH DIGIT SIX}\N{DESERET SMALL LETTER LONG I}", $script_run, "... likewise if the digits come first");
+    like("\N{DESERET SMALL LETTER WU}\N{FULLWIDTH DIGIT FIVE}", $script_run, "Fullwidth digits may be digits for Deseret");
+    like("\N{FULLWIDTH DIGIT SIX}\N{DESERET SMALL LETTER LONG I}", $script_run, "... likewise if the digits come first");
 
     like("1234567890\N{ARABIC LETTER ALEF}", $script_run, "[0-9] work for Arabic");
     unlike("1234567890\N{ARABIC LETTER ALEF}\N{ARABIC-INDIC DIGIT FOUR}\N{ARABIC-INDIC DIGIT FIVE}", $script_run, "... but not in combination with real ARABIC digits");
@@ -104,4 +104,19 @@ foreach my $type ('script_run', 'sr', 'atomic_script_run', 'asr') {
     like("\x{3041}12\x{3041}", qr/^(*sr:.{4})/,
          "Script without own zero works with ASCII digits");
 
+    like("A\x{ff10}\x{ff19}B", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Latin"); # perl #133547
+    like("A\x{ff10}BC", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Latin"); # perl #133547
+    like("A\x{1d7ce}\x{1d7cf}B", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Latin"); # perl #133547
+    like("A\x{1d7ce}BC", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Latin"); # perl #133547
+    like("\x{1d7ce}\x{1d7cf}AB", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Latin"); # perl #133547
+    like("α\x{1d7ce}βγ", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Greek"); # perl #133547
+    like("\x{1d7ce}αβγ", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Greek"); # perl #133547
+
 done_testing();

-- 
Perl5 Master Repository



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About