develooper Front page | perl.perl5.changes | Postings from February 2019

[perl.git] branch blead updated. v5.29.7-63-g2a614cdcff

From:
Karl Williamson
Date:
February 5, 2019 04:01
Subject:
[perl.git] branch blead updated. v5.29.7-63-g2a614cdcff
Message ID:
E1gqruo-0001Ca-TX@git.dc.perl.space
In perl.git, the branch blead has been updated

<https://perl5.git.perl.org/perl.git/commitdiff/2a614cdcffdf336bc49e524a4ac3af94df7d4d00?hp=a1f354d3762aa87a796bc3d383629bfc853472f5>

- Log -----------------------------------------------------------------
commit 2a614cdcffdf336bc49e524a4ac3af94df7d4d00
Author: Karl Williamson <khw@cpan.org>
Date:   Fri Feb 1 08:48:20 2019 -0700

    regen/unicode_constants.pl: generate UTF-8 for U+307
    
    This will be needed in a future commit

commit e0bfe19f1cff16db3441822a6812a07ca124c861
Author: Karl Williamson <khw@cpan.org>
Date:   Fri Feb 1 08:29:51 2019 -0700

    t/loc_tools.pl: Add fcn to return all UTF-8 locales
    
    This will be needed in future commits

commit 79ba27676437312e9dd6ce7ea8a47676cb57e5fc
Author: Karl Williamson <khw@cpan.org>
Date:   Fri Feb 1 11:45:34 2019 -0700

    pp.c: White-space only
    
    Indent block newly formed in the previous commit

commit dbb3849a8c02c652b48b25b770fc36b743b162db
Author: Karl Williamson <khw@cpan.org>
Date:   Fri Feb 1 11:43:10 2019 -0700

    pp.c: Avoid use of unsafe function
    
    The function is unsafe because it doesn't check for running off the end
    of the buffer if presented with illegal UTF-8.  The only remaining use
    now is from mathoms.c.

commit 02601e33951e916a19e46272146a0b59862aaff5
Author: Karl Williamson <khw@cpan.org>
Date:   Fri Feb 1 11:41:14 2019 -0700

    pp.c: Add branch prediction hint
    
    This conditional is very rarely true

commit 2f8f985a27faf25c5a535cbe67d098690668c0f9
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jan 30 11:24:12 2019 -0700

    pp.c: Don't assume worst case memory needs
    
    Since 5.28, there has been a function that will calculate the expansion
    of a string when converted into UTF-8, using per-word operations.  This
    means it runs 8 times faster than doing this count previously would have
    taken.
    
    I've come to believe it is better to calculate how much memory we need
    than to overallocate based on worst-case scenarios.  This is because in
    very large strings, over allocating can lead to unnecessary inefficient
    processing.
    
    This commit changes several instances in pp.c where a string needs to be
    converted to UTF-8 to not assume the worst case, but instead calculate
    what's needed using the faster function.

commit 78ed8e3629d58d11345e4367dbe14b9603e8c84b
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jan 30 11:09:01 2019 -0700

    pp.c: Don't use function call for easy copy
    
    Like the previous commit, this code is adding the UTF-8 for a Greek
    character to a string.  It previously used Copy, but this character is
    representable as two bytes in both ASCII and EBCDIC UTF-8, the only
    character sets that Perl will ever supports, so we can use the
    specialized code that is used most everywhere else for two byte UTF-8
    characters, avoiding the function overhead, and having to treat this
    character as particularly special.

commit 93327b758a54c8e1ff7ee137a513caff4d077a7d
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jan 30 10:52:41 2019 -0700

    pp.c: Don't use function call for easy copy
    
    This code is adding the UTF-8 for a Greek character to a string.  It
    previously used Copy, but this character is representable as two bytes
    in both ASCII and EBCDIC UTF-8, the only character sets that Perl will
    ever supports, so we can use the specialized code that is used most
    everywhere else for two byte UTF-8 characters, avoiding the function
    overhead, and having to treat this character as particularly special.

commit 526f8cbff8ce0a6402d8eb64ac3970e48c8716c3
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jan 30 10:35:21 2019 -0700

    pp.c: pp_fc(): Simplify
    
    The function being called does everything that the code being eliminated
    here did.  We just pass the function the final destination instead of a
    temporary.

commit a8e41ef404b996cb8f50be6cce716145ac4a3f67
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jan 30 10:27:17 2019 -0700

    pp.c: White-space, comments only

commit ca62a7c2ce92965c24def9ea277e9ad42ea797d1
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jan 30 10:02:35 2019 -0700

    pp.c: Reorder clause order in an 'if'
    
    This makes the test most likely to fail be first, and adding an
    UNLIKELY() to it, thus saving a conditional in most instances.

commit df7d4938c6907db4b8030fd133ca9d55e1e44a0d
Author: Karl Williamson <khw@cpan.org>
Date:   Tue Jan 29 22:02:59 2019 -0700

    pp.c: Use faster method to convert to UTF-8
    
    There is a special inline function that's used when converting a single
    byte to UTF-8, that is faster than the more general one used prior to
    this commit.

commit f4cd1cd9e8d271b135a75b4b6fd817fa758c112a
Author: Karl Williamson <khw@cpan.org>
Date:   Tue Jan 29 22:01:18 2019 -0700

    pp.c: Add missing assert
    
    The comments say there is an assert, but it wasn't there.

commit 1c4079115ad9f58e29e98bd09de8772737e77be5
Author: Karl Williamson <khw@cpan.org>
Date:   Mon Feb 4 16:02:35 2019 -0700

    t/op/lc.t: Add 'use strict'

commit 5583386ecf7417b7a05ab2f75b7284e6c90079fa
Author: Karl Williamson <khw@cpan.org>
Date:   Tue Jan 29 22:25:03 2019 -0700

    t/re/fold_grind.pl: White-space only
    
    Just align some logical or clauses for readability.

commit 247985d477048e4fea858e98efd13e728744b370
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jan 30 09:08:13 2019 -0700

    handy.h: Add comment

commit 5a10328cd52e3a7a3fa9244dbc367ee439850cab
Author: Karl Williamson <khw@cpan.org>
Date:   Fri Jan 25 09:55:58 2019 -0700

    handy.h: White-space only
    
    Vertically align the ternary colon with the question mark above it.

commit 9d3980bc229750e6c07726fe529f02bf4dc6a5a5
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jan 23 15:42:35 2019 -0700

    handy.h: Add void * casts to memEQ, memNE
    
    This change is to allow these macros to be called without having to do
    casting in the call.

commit 813cfad2cc5a494533659beaa4833ff222b4e131
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jan 30 15:00:30 2019 -0700

    regcomp.c: Fix recent optimization of [...] bug
    
    This bug was introduced in b2296192536090829ba6d2cb367456f4e346dcc6
    n 5.29.7.  Using /il should not result in looking for a [:posix:] class
    that matches the code points given.

-----------------------------------------------------------------------

Summary of changes:
 embed.fnc                  |   2 +-
 embed.h                    |   2 +-
 handy.h                    |  10 +-
 invlist_inline.h           |   2 +-
 pp.c                       | 239 +++++++++++++++++++++++++--------------------
 proto.h                    |   2 +-
 regcomp.c                  |   2 +-
 regen/unicode_constants.pl |   4 +-
 t/loc_tools.pl             |  24 ++++-
 t/op/lc.t                  |   7 +-
 t/re/anyof.t               |   1 +
 t/re/fold_grind.pl         |   6 +-
 unicode_constants.h        |  12 +--
 13 files changed, 178 insertions(+), 135 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index d311ca7f51..c7816d531c 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1748,7 +1748,7 @@ EXp	|SV*	|_core_swash_init|NN const char* pkg|NN const char* name \
 		|NN SV* listsv|I32 minbits|I32 none \
 		|NULLOK SV* invlist|NULLOK U8* const flags_p
 #endif
-#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C)
+#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C)
 EiMRn	|UV*	|invlist_array	|NN SV* const invlist
 EiMRn	|bool	|is_invlist	|NN SV* const invlist
 EiMRn	|bool*	|get_invlist_offset_addr|NN SV* invlist
diff --git a/embed.h b/embed.h
index f3b95eadbd..149f1bee25 100644
--- a/embed.h
+++ b/embed.h
@@ -1249,7 +1249,7 @@
 #endif
 #define regprop(a,b,c,d,e)	Perl_regprop(aTHX_ a,b,c,d,e)
 #  endif
-#  if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C)
+#  if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C)
 #define _get_swash_invlist(a)	Perl__get_swash_invlist(aTHX_ a)
 #define _invlist_contains_cp	S__invlist_contains_cp
 #define _invlist_len		S__invlist_len
diff --git a/handy.h b/handy.h
index d2a7801a25..954b9caa30 100644
--- a/handy.h
+++ b/handy.h
@@ -507,8 +507,8 @@ based on the underlying C library functions):
 #define strnNE(s1,s2,l) (strncmp(s1,s2,l) != 0)
 #define strnEQ(s1,s2,l) (strncmp(s1,s2,l) == 0)
 
-#define memNE(s1,s2,l) (memcmp(s1,s2,l) != 0)
-#define memEQ(s1,s2,l) (memcmp(s1,s2,l) == 0)
+#define memEQ(s1,s2,l) (memcmp(((const void *) (s1)), ((const void *) (s2)), l) == 0)
+#define memNE(s1,s2,l) (! memEQ(s1,s2,l))
 
 /* memEQ and memNE where second comparand is a string constant */
 #define memEQs(s1, l, s2) \
@@ -1540,12 +1540,14 @@ END_EXTERN_C
                                           || (char)(c) == '_'))
 
 /* These next three are also for internal core Perl use only: case-change
- * helper macros */
+ * helper macros.  The reason for using the PL_latin arrays is in case the
+ * system function is defective; it ensures uniform results that conform to the
+ * Unicod standard. */
 #define _generic_toLOWER_LC(c, function, cast)  (! FITS_IN_8_BITS(c)           \
                                                 ? (c)                          \
                                                 : (IN_UTF8_CTYPE_LOCALE)       \
                                                   ? PL_latin1_lc[ (U8) (c) ]   \
-                                                : (cast)function((cast)(c)))
+                                                  : (cast)function((cast)(c)))
 
 /* Note that the result can be larger than a byte in a UTF-8 locale.  It
  * returns a single value, so can't adequately return the upper case of LATIN
diff --git a/invlist_inline.h b/invlist_inline.h
index cd002cef19..1304b4543a 100644
--- a/invlist_inline.h
+++ b/invlist_inline.h
@@ -9,7 +9,7 @@
 #ifndef PERL_INVLIST_INLINE_H_
 #define PERL_INVLIST_INLINE_H_
 
-#if defined(PERL_IN_UTF8_C) || defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C)
+#if defined(PERL_IN_UTF8_C) || defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_PP_C)
 
 /* An element is in an inversion list iff its index is even numbered: 0, 2, 4,
  * etc */
diff --git a/pp.c b/pp.c
index 880f266081..522e985931 100644
--- a/pp.c
+++ b/pp.c
@@ -28,12 +28,10 @@
 #include "perl.h"
 #include "keywords.h"
 
+#include "invlist_inline.h"
 #include "reentr.h"
 #include "regcharclass.h"
 
-static const STRLEN small_mu_len = sizeof(GREEK_SMALL_LETTER_MU_UTF8) - 1;
-static const STRLEN capital_iota_len = sizeof(GREEK_CAPITAL_LETTER_IOTA_UTF8) - 1;
-
 /* variations on pp_null */
 
 PP(pp_stub)
@@ -364,7 +362,7 @@ PP(pp_rv2cv)
 	cv = SvTYPE(SvRV(gv)) == SVt_PVCV
 	    ? MUTABLE_CV(SvRV(gv))
 	    : MUTABLE_CV(gv);
-    }    
+    }
     else
 	cv = MUTABLE_CV(&PL_sv_undef);
     SETs(MUTABLE_SV(cv));
@@ -670,7 +668,7 @@ PP(pp_study)
 
 PP(pp_trans)
 {
-    dSP; 
+    dSP;
     SV *sv;
 
     if (PL_op->op_flags & OPf_STACKED)
@@ -1161,18 +1159,18 @@ PP(pp_pow)
 			else if (result <= (UV)IV_MAX)
 			    /* answer negative, fits in IV */
 			    SETi( -(IV)result );
-			else if (result == (UV)IV_MIN) 
+			else if (result == (UV)IV_MIN)
 			    /* 2's complement assumption: special case IV_MIN */
 			    SETi( IV_MIN );
 			else
 			    /* answer negative, doesn't fit */
 			    SETn( -(NV)result );
 			RETURN;
-		    } 
+		    }
 		}
     }
   float_it:
-#endif    
+#endif
     {
 	NV right = SvNV_nomg(svr);
 	NV left  = SvNV_nomg(svl);
@@ -1905,7 +1903,7 @@ PP(pp_subtract)
 	    UV result;
 	    UV buv;
 	    bool buvok = SvUOK(svr);
-	
+
 	    if (buvok)
 		buv = SvUVX(svr);
 	    else {
@@ -2893,7 +2891,7 @@ PP(pp_rand)
     {
 	dSP;
 	NV value;
-    
+
 	if (MAXARG < 1)
 	{
 	    EXTEND(SP, 1);
@@ -3064,7 +3062,7 @@ PP(pp_oct)
 	 /* If Unicode, try to downgrade
 	  * If not possible, croak. */
 	 SV* const tsv = sv_2mortal(newSVsv(sv));
-	
+
 	 SvUTF8_on(tsv);
 	 sv_utf8_downgrade(tsv, FALSE);
 	 tmps = SvPV_const(tsv, len);
@@ -3539,7 +3537,7 @@ PP(pp_index)
             /* $lex = (index() == -1) */
             sv_setsv(TARG, TOPs);
     }
-    else 
+    else
         PUSHi(retval);
     RETURN;
 }
@@ -3681,7 +3679,7 @@ PP(pp_crypt)
 #endif
 }
 
-/* Generally UTF-8 and UTF-EBCDIC are indistinguishable at this level.  So 
+/* Generally UTF-8 and UTF-EBCDIC are indistinguishable at this level.  So
  * most comments below say UTF-8, when in fact they mean UTF-EBCDIC as well */
 
 
@@ -3747,12 +3745,15 @@ PP(pp_ucfirst)
 #endif
 	}
         else {
+
 #ifdef USE_LOCALE_CTYPE
+
 	    _toLOWER_utf8_flags(s, s + slen, tmpbuf, &tculen, IN_LC_RUNTIME(LC_CTYPE));
 #else
 	    _toLOWER_utf8_flags(s, s + slen, tmpbuf, &tculen, 0);
 #endif
-	}
+
+        }
 
         /* we can't do in-place if the length changes.  */
         if (ulen != tculen) inplace = FALSE;
@@ -3760,7 +3761,7 @@ PP(pp_ucfirst)
     }
     else { /* Non-zero length, non-UTF-8,  Need to consider locale and if
 	    * latin1 is treated as caseless.  Note that a locale takes
-	    * precedence */ 
+	    * precedence */
 	ulen = 1;	/* Original character is 1 byte */
 	tculen = 1;	/* Most characters will require one byte, but this will
 			 * need to be overridden for the tricky ones */
@@ -3824,13 +3825,16 @@ PP(pp_ucfirst)
 		    inplace = FALSE;
 
                     /* If the result won't fit in a byte, the entire result
-                     * will have to be in UTF-8.  Assume worst case sizing in
-                     * conversion. (all latin1 characters occupy at most two
-                     * bytes in utf8) */
+                     * will have to be in UTF-8.  Allocate enough space for the
+                     * expanded first byte, and if UTF-8, the rest of the input
+                     * string, some or all of which may also expand to two
+                     * bytes, plus the terminating NUL. */
 		    if (title_ord > 255) {
 			doing_utf8 = TRUE;
 			convert_source_to_utf8 = TRUE;
-			need = slen * 2 + 1;
+			need = slen
+                            + variant_under_utf8_count(s, s + slen)
+                            + 1;
 
                         /* The (converted) UTF-8 and UTF-EBCDIC lengths of all
                          * (both) characters whose title case is above 255 is
@@ -3890,13 +3894,16 @@ PP(pp_ucfirst)
 
 		/* Assert tculen is 2 here because the only two characters that
 		 * get to this part of the code have 2-byte UTF-8 equivalents */
+                assert(tculen == 2);
 		*d++ = *tmpbuf;
 		*d++ = *(tmpbuf + 1);
 		s++;	/* We have just processed the 1st char */
 
-		for (; s < send; s++) {
-		    d = uvchr_to_utf8(d, *s);
-		}
+                while (s < send) {
+                    append_utf8_from_native_byte(*s, &d);
+                    s++;
+                }
+
 		*d = '\0';
 		SvCUR_set(dest, d - (U8*)SvPVX_const(dest));
 	    }
@@ -3908,7 +3915,7 @@ PP(pp_ucfirst)
 	}
 
     }
-    else {  /* Neither source nor dest are in or need to be UTF-8 */
+    else {  /* Neither source nor dest are, nor need to be UTF-8 */
 	if (slen) {
 	    if (inplace) {  /* in-place, only need to change the 1st char */
 		*d = *tmpbuf;
@@ -3949,9 +3956,6 @@ PP(pp_ucfirst)
     return NORMAL;
 }
 
-/* There's so much setup/teardown code common between uc and lc, I wonder if
-   it would be worth merging the two, and just having a switch outside each
-   of the three tight loops.  There is less and less commonality though */
 PP(pp_uc)
 {
     dSP;
@@ -4018,6 +4022,8 @@ PP(pp_uc)
 	const U8 *const send = s + len;
 	U8 tmpbuf[UTF8_MAXBYTES_CASE+1];
 
+#define GREEK_CAPITAL_LETTER_IOTA 0x0399
+#define COMBINING_GREEK_YPOGEGRAMMENI 0x0345
 	/* All occurrences of these are to be moved to follow any other marks.
 	 * This is context-dependent.  We may not be passed enough context to
 	 * move the iota subscript beyond all of them, but we do the best we can
@@ -4034,12 +4040,16 @@ PP(pp_uc)
 	    STRLEN u;
 	    STRLEN ulen;
 	    UV uv;
-	    if (in_iota_subscript && ! _is_utf8_mark(s)) {
+	    if (UNLIKELY(in_iota_subscript)) {
+                UV cp = utf8_to_uvchr_buf(s, send, NULL);
+
+                if (! _invlist_contains_cp(PL_utf8_mark, cp)) {
 
-		/* A non-mark.  Time to output the iota subscript */
-		Copy(GREEK_CAPITAL_LETTER_IOTA_UTF8, d, capital_iota_len, U8);
-                d += capital_iota_len;
-		in_iota_subscript = FALSE;
+                    /* A non-mark.  Time to output the iota subscript */
+                    *d++ = UTF8_TWO_BYTE_HI(GREEK_CAPITAL_LETTER_IOTA);
+                    *d++ = UTF8_TWO_BYTE_LO(GREEK_CAPITAL_LETTER_IOTA);
+                    in_iota_subscript = FALSE;
+                }
             }
 
             /* Then handle the current character.  Get the changed case value
@@ -4051,8 +4061,6 @@ PP(pp_uc)
 #else
             uv = _toUPPER_utf8_flags(s, send, tmpbuf, &ulen, 0);
 #endif
-#define GREEK_CAPITAL_LETTER_IOTA 0x0399
-#define COMBINING_GREEK_YPOGEGRAMMENI 0x0345
             if (uv == GREEK_CAPITAL_LETTER_IOTA
                 && utf8_to_uvchr_buf(s, send, 0) == COMBINING_GREEK_YPOGEGRAMMENI)
             {
@@ -4066,9 +4074,10 @@ PP(pp_uc)
 
                     /* If someone uppercases one million U+03B0s we SvGROW()
                      * one million times.  Or we could try guessing how much to
-                     * allocate without allocating too much.  Such is life.
-                     * See corresponding comment in lc code for another option
-                     * */
+                     * allocate without allocating too much.  But we can't
+                     * really guess without examining the rest of the string.
+                     * Such is life.  See corresponding comment in lc code for
+                     * another option */
                     d = o + (U8*) SvGROW(dest, min);
                 }
                 Copy(tmpbuf, d, ulen, U8);
@@ -4077,8 +4086,8 @@ PP(pp_uc)
             s += u;
 	}
 	if (in_iota_subscript) {
-            Copy(GREEK_CAPITAL_LETTER_IOTA_UTF8, d, capital_iota_len, U8);
-            d += capital_iota_len;
+            *d++ = UTF8_TWO_BYTE_HI(GREEK_CAPITAL_LETTER_IOTA);
+            *d++ = UTF8_TWO_BYTE_LO(GREEK_CAPITAL_LETTER_IOTA);
 	}
 	SvUTF8_on(dest);
 	*d = '\0';
@@ -4112,6 +4121,8 @@ PP(pp_uc)
           do_uni_rules:
 #endif
 		for (; s < send; d++, s++) {
+                    Size_t extra;
+
 		    *d = toUPPER_LATIN1_MOD(*s);
 		    if (LIKELY(*d != LATIN_SMALL_LETTER_Y_WITH_DIAERESIS)) {
                         continue;
@@ -4130,7 +4141,7 @@ PP(pp_uc)
 
 			/* uc() of this requires 2 characters, but they are
 			 * ASCII.  If not enough room, grow the string */
-			if (SvLEN(dest) < ++min) {	
+			if (SvLEN(dest) < ++min) {
 			    const UV o = d - (U8*)SvPVX_const(dest);
 			    d = o + (U8*) SvGROW(dest, min);
 			}
@@ -4141,48 +4152,54 @@ PP(pp_uc)
 
 		    /* The other two special handling characters have their
 		     * upper cases outside the latin1 range, hence need to be
-		     * in UTF-8, so the whole result needs to be in UTF-8.  So,
-		     * here we are somewhere in the middle of processing a
-		     * non-UTF-8 string, and realize that we will have to convert
-		     * the whole thing to UTF-8.  What to do?  There are
-		     * several possibilities.  The simplest to code is to
-		     * convert what we have so far, set a flag, and continue on
-		     * in the loop.  The flag would be tested each time through
-		     * the loop, and if set, the next character would be
-		     * converted to UTF-8 and stored.  But, I (khw) didn't want
-		     * to slow down the mainstream case at all for this fairly
-		     * rare case, so I didn't want to add a test that didn't
-		     * absolutely have to be there in the loop, besides the
-		     * possibility that it would get too complicated for
-		     * optimizers to deal with.  Another possibility is to just
-		     * give up, convert the source to UTF-8, and restart the
-		     * function that way.  Another possibility is to convert
-		     * both what has already been processed and what is yet to
-		     * come separately to UTF-8, then jump into the loop that
-		     * handles UTF-8.  But the most efficient time-wise of the
-		     * ones I could think of is what follows, and turned out to
-		     * not require much extra code.  */
-
-		    /* Convert what we have so far into UTF-8, telling the
+		     * in UTF-8, so the whole result needs to be in UTF-8.
+                     *
+                     * So, here we are somewhere in the middle of processing a
+                     * non-UTF-8 string, and realize that we will have to
+                     * convert the whole thing to UTF-8.  What to do?  There
+                     * are several possibilities.  The simplest to code is to
+                     * convert what we have so far, set a flag, and continue on
+                     * in the loop.  The flag would be tested each time through
+                     * the loop, and if set, the next character would be
+                     * converted to UTF-8 and stored.  But, I (khw) didn't want
+                     * to slow down the mainstream case at all for this fairly
+                     * rare case, so I didn't want to add a test that didn't
+                     * absolutely have to be there in the loop, besides the
+                     * possibility that it would get too complicated for
+                     * optimizers to deal with.  Another possibility is to just
+                     * give up, convert the source to UTF-8, and restart the
+                     * function that way.  Another possibility is to convert
+                     * both what has already been processed and what is yet to
+                     * come separately to UTF-8, then jump into the loop that
+                     * handles UTF-8.  But the most efficient time-wise of the
+                     * ones I could think of is what follows, and turned out to
+                     * not require much extra code.
+                     *
+                     * First, calculate the extra space needed for the
+                     * remainder of the source needing to be in UTF-8.  The
+                     * uppercase of a character below 256 occupies the same
+                     * number of bytes as the original.  Therefore, the space
+                     * needed is the that number plus the number of characters
+                     * that become two bytes when converted to UTF-8. */
+
+                    extra = send - s + variant_under_utf8_count(s, send);
+
+                    /* Convert what we have so far into UTF-8, telling the
 		     * function that we know it should be converted, and to
 		     * allow extra space for what we haven't processed yet.
-		     * Assume the worst case space requirements for converting
-		     * what we haven't processed so far: that it will require
-		     * two bytes for each remaining source character, plus the
-		     * NUL at the end.  This may cause the string pointer to
-		     * move, so re-find it. */
+                     *
+                     * This may cause the string pointer to move, so need to
+                     * save and re-find it. */
 
 		    len = d - (U8*)SvPVX_const(dest);
 		    SvCUR_set(dest, len);
 		    len = sv_utf8_upgrade_flags_grow(dest,
 						SV_GMAGIC|SV_FORCE_UTF8_UPGRADE,
-						(send -s) * 2 + 1);
+						extra);
 		    d = (U8*)SvPVX(dest) + len;
 
-		    /* Now process the remainder of the source, converting to
-		     * upper and UTF-8.  If a resulting byte is invariant in
-		     * UTF-8, output it as-is, otherwise convert to UTF-8 and
-		     * append it to the output. */
+                    /* Now process the remainder of the source, simultaneously
+                     * converting to upper and UTF-8. */
 		    for (; s < send; s++) {
 			(void) _to_upper_title_latin1(*s, d, &len, 'S');
 			d += len;
@@ -4270,13 +4287,15 @@ PP(pp_lc)
 	    STRLEN ulen;
 
 #ifdef USE_LOCALE_CTYPE
+
 	    _toLOWER_utf8_flags(s, send, tmpbuf, &ulen, IN_LC_RUNTIME(LC_CTYPE));
 #else
 	    _toLOWER_utf8_flags(s, send, tmpbuf, &ulen, 0);
 #endif
 
-	    /* Here is where we would do context-sensitive actions.  See the
-	     * commit message for 86510fb15 for why there isn't any */
+            /* Here is where we would do context-sensitive actions for the
+             * Greek final sigma.  See the commit message for 86510fb15 for why
+             * there isn't any */
 
 	    if (ulen > u && (SvLEN(dest) < (min += ulen - u))) {
 
@@ -4372,7 +4391,7 @@ PP(pp_quotemeta)
 #ifdef USE_LOCALE_CTYPE
 		    /* In locale, we quote all non-ASCII Latin1 chars.
 		     * Otherwise use the quoting rules */
-		    
+
 		    IN_LC_RUNTIME(LC_CTYPE)
 			||
 #endif
@@ -4520,52 +4539,57 @@ PP(pp_fc)
 #ifdef USE_LOCALE_CTYPE
       do_uni_folding:
 #endif
-            /* For ASCII and the Latin-1 range, there's only two troublesome
-             * folds, \x{DF} (\N{LATIN SMALL LETTER SHARP S}), which under full
-             * casefolding becomes 'ss'; and \x{B5} (\N{MICRO SIGN}), which
-             * under any fold becomes \x{3BC} (\N{GREEK SMALL LETTER MU}) --
+            /* For ASCII and the Latin-1 range, there's two
+             * troublesome folds:
+             *      \x{DF} (\N{LATIN SMALL LETTER SHARP S}), which under full
+             *             casefolding becomes 'ss';
+             *      \x{B5} (\N{MICRO SIGN}), which under any fold becomes
+             *             \x{3BC} (\N{GREEK SMALL LETTER MU})
              * For the rest, the casefold is their lowercase.  */
             for (; s < send; d++, s++) {
                 if (*s == MICRO_SIGN) {
+                    Size_t extra = send - s
+                                 + variant_under_utf8_count(s, send);
+
                     /* \N{MICRO SIGN}'s casefold is \N{GREEK SMALL LETTER MU},
                      * which is outside of the latin-1 range. There's a couple
                      * of ways to deal with this -- khw discusses them in
                      * pp_lc/uc, so go there :) What we do here is upgrade what
                      * we had already casefolded, then enter an inner loop that
-                     * appends the rest of the characters as UTF-8. */
+                     * appends the rest of the characters as UTF-8.
+                     *
+                     * First we calculate the needed size of the upgraded dest
+                     * beyond what's been processed already (the upgrade
+                     * function figures that out).  In UTF-8 strings, the fold case of a
+                     * character below 256 occupies the same number of bytes as
+                     * the original (even the Sharp S).  Therefore, the space
+                     * needed is the number of bytes remaining plus the number
+                     * of characters that become two bytes when converted to
+                     * UTF-8. */
+
+                    /* Growing may move things, so have to save and recalculate
+                     * 'd' */
                     len = d - (U8*)SvPVX_const(dest);
                     SvCUR_set(dest, len);
                     len = sv_utf8_upgrade_flags_grow(dest,
                                                 SV_GMAGIC|SV_FORCE_UTF8_UPGRADE,
-						/* The max expansion for latin1
-						 * chars is 1 byte becomes 2 */
-                                                (send -s) * 2 + 1);
+                                                extra);
                     d = (U8*)SvPVX(dest) + len;
 
-                    Copy(GREEK_SMALL_LETTER_MU_UTF8, d, small_mu_len, U8);
-                    d += small_mu_len;
+                    *d++ = UTF8_TWO_BYTE_HI(GREEK_SMALL_LETTER_MU);
+                    *d++ = UTF8_TWO_BYTE_LO(GREEK_SMALL_LETTER_MU);
                     s++;
+
                     for (; s < send; s++) {
                         STRLEN ulen;
-                        UV fc = _to_uni_fold_flags(*s, tmpbuf, &ulen, flags);
-                        if UVCHR_IS_INVARIANT(fc) {
-                            if (full_folding
-                                && *s == LATIN_SMALL_LETTER_SHARP_S)
-                            {
-                                *d++ = 's';
-                                *d++ = 's';
-                            }
-                            else
-                                *d++ = (U8)fc;
-                        }
-                        else {
-                            Copy(tmpbuf, d, ulen, U8);
-                            d += ulen;
-                        }
+                        _to_uni_fold_flags(*s, d, &ulen, flags);
+                        d += ulen;
                     }
                     break;
                 }
-                else if (full_folding && *s == LATIN_SMALL_LETTER_SHARP_S) {
+                else if (   UNLIKELY(*s == LATIN_SMALL_LETTER_SHARP_S)
+                         && full_folding)
+                {
                     /* Under full casefolding, LATIN SMALL LETTER SHARP S
                      * becomes "ss", which may require growing the SV. */
                     if (SvLEN(dest) < ++min) {
@@ -4575,8 +4599,7 @@ PP(pp_fc)
                     *(d)++ = 's';
                     *d = 's';
                 }
-                else { /* If it's not one of those two, the fold is their lower
-                          case */
+                else { /* Else, the fold is the lower case */
                     *d = toLOWER_LATIN1(*s);
                 }
              }
@@ -5387,7 +5410,7 @@ PP(pp_splice)
 	i = -diff;
 	while (i)
 	    dst[--i] = NULL;
-	
+
 	if (newlen) {
  	    Copy( tmparyval, AvARRAY(ary) + offset, newlen, SV* );
 	    Safefree(tmparyval);
@@ -5838,7 +5861,7 @@ PP(pp_split)
             } else {
                 while (m < strend && !isSPACE(*m))
                     ++m;
-            }  
+            }
 	    if (m >= strend)
 		break;
 
@@ -5876,7 +5899,7 @@ PP(pp_split)
             } else {
                 while (s < strend && isSPACE(*s))
                     ++s;
-            } 	    
+            }
 	}
     }
     else if (RX_EXTFLAGS(rx) & RXf_START_ONLY) {
@@ -6560,7 +6583,7 @@ PP(pp_lvref)
         }
       }
       else if (arg) {
-	S_localise_gv_slot(aTHX_ (GV *)arg, 
+	S_localise_gv_slot(aTHX_ (GV *)arg,
 				 PL_op->op_private & OPpLVREF_TYPE);
       }
       else if (!(PL_op->op_private & OPpPAD_STATE))
@@ -6643,7 +6666,7 @@ PP(pp_anonconst)
  *  for $:   (OPf_STACKED ? *sp : $_[N])
  *  for @/%: @_[N..$#_]
  *
- * It's equivalent to 
+ * It's equivalent to
  *    my $foo = $_[N];
  * or
  *    my $foo = (value-on-stack)
diff --git a/proto.h b/proto.h
index daf338707b..ba5623d4a2 100644
--- a/proto.h
+++ b/proto.h
@@ -5650,7 +5650,7 @@ PERL_CALLCONV void	Perl_regprop(pTHX_ const regexp *prog, SV* sv, const regnode*
 #define PERL_ARGS_ASSERT_REGPROP	\
 	assert(sv); assert(o)
 #endif
-#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C)
+#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C)
 PERL_CALLCONV SV*	Perl__get_swash_invlist(pTHX_ SV* const swash)
 			__attribute__warn_unused_result__;
 #define PERL_ARGS_ASSERT__GET_SWASH_INVLIST	\
diff --git a/regcomp.c b/regcomp.c
index 58cb941b06..493729256a 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -18815,7 +18815,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth,
             }
         }
 
-        if (! posixl) {
+        if (! (anyof_flags & ANYOF_LOCALE_FLAGS)) {
             PERL_UINT_FAST8_T type;
             SV * intersection = NULL;
             SV* d_invlist = NULL;
diff --git a/regen/unicode_constants.pl b/regen/unicode_constants.pl
index 3bddd90ff8..c3fa70a6e6 100644
--- a/regen/unicode_constants.pl
+++ b/regen/unicode_constants.pl
@@ -240,9 +240,7 @@ __DATA__
 U+017F string
 
 U+0300 string
-
-U+0399 string
-U+03BC string
+U+0307 string
 
 U+1E9E string_skip_if_undef
 
diff --git a/t/loc_tools.pl b/t/loc_tools.pl
index 7afb7bacf6..5a4379f225 100644
--- a/t/loc_tools.pl
+++ b/t/loc_tools.pl
@@ -501,8 +501,8 @@ sub is_locale_utf8 ($) { # Return a boolean as to if core Perl thinks the input
     return $ret;
 }
 
-sub find_utf8_ctype_locale (;$) { # Return the name of a locale that core Perl
-                                  # thinks is a UTF-8 LC_CTYPE locale.
+sub find_utf8_ctype_locales (;$) { # Return the names of the locales that core
+                                  # Perl thinks are UTF-8 LC_CTYPE locales.
                                   # Optional parameter is a reference to a
                                   # list of locales to try; if omitted, this
                                   # tries all locales it can find on the
@@ -510,6 +510,7 @@ sub find_utf8_ctype_locale (;$) { # Return the name of a locale that core Perl
     return unless locales_enabled('LC_CTYPE');
 
     my $locales_ref = shift;
+    my @return;
 
     if (! defined $locales_ref) {
 
@@ -518,9 +519,26 @@ sub find_utf8_ctype_locale (;$) { # Return the name of a locale that core Perl
     }
 
     foreach my $locale (@$locales_ref) {
-        return $locale if is_locale_utf8($locale);
+        push @return, $locale if is_locale_utf8($locale);
     }
 
+    return @return;
+}
+
+
+sub find_utf8_ctype_locale (;$) { # Return the name of a locale that core Perl
+                                  # thinks is a UTF-8 LC_CTYPE
+                                  # locale.
+                                  # Optional parameter is a reference to a
+                                  # list of locales to try; if omitted, this
+                                  # tries all locales it can find on the
+                                  # platform
+    my $try_locales_ref = shift;
+
+    my @utf8_locales = find_utf8_ctype_locales($try_locales_ref);
+
+    return $utf8_locales[0] if @utf8_locales;
+
     return;
 }
 
diff --git a/t/op/lc.t b/t/op/lc.t
index 2ce65ac73c..60b966ff9f 100644
--- a/t/op/lc.t
+++ b/t/op/lc.t
@@ -1,4 +1,5 @@
 #!./perl
+use strict;
 
 # This file is intentionally encoded in latin-1.
 #
@@ -164,9 +165,10 @@ is(uc("\x{1C5}") , "\x{1C4}",      "U+01C5 uc is U+01C4");
 is(uc("\x{1C6}") , "\x{1C4}",      "U+01C6 uc is U+01C4, too");
 
 # #18107: A host of bugs involving [ul]c{,first}. AMS 20021106
-$a = "\x{3c3}foo.bar"; # \x{3c3} == GREEK SMALL LETTER SIGMA.
-$b = "\x{3a3}FOO.BAR"; # \x{3a3} == GREEK CAPITAL LETTER SIGMA.
+my $a = "\x{3c3}foo.bar"; # \x{3c3} == GREEK SMALL LETTER SIGMA.
+my $b = "\x{3a3}FOO.BAR"; # \x{3a3} == GREEK CAPITAL LETTER SIGMA.
 
+my $c;
 ($c = $b) =~ s/(\w+)/lc($1)/ge;
 is($c , $a, "Using s///e to change case.");
 
@@ -310,6 +312,7 @@ constantfolding
 
 # In-place lc/uc should not corrupt string buffers when given a non-utf8-
 # flagged thingy that stringifies to utf8
+my %h;
 $h{k} = bless[], "\x{3b0}\x{3b0}\x{3b0}bcde"; # U+03B0 grows with uc()
    # using delete marks it as TEMP, so uc-in-place is permitted
 like uc delete $h{k}, qr "^(?:\x{3a5}\x{308}\x{301}){3}BCDE=ARRAY\(.*\)",
diff --git a/t/re/anyof.t b/t/re/anyof.t
index ad0a2d9ada..32e0bae9ad 100644
--- a/t/re/anyof.t
+++ b/t/re/anyof.t
@@ -462,6 +462,7 @@ my @tests = (
     '(?l:[\x{212A}])' => 'ANYOFL[212A]',
     '(?l:[\s\x{212A}])' => 'ANYOFPOSIXL[\s][1680 2000-200A 2028-2029 202F 205F 212A 3000]',
     '(?l:[^\S\x{202F}])' => 'ANYOFPOSIXL[^\\S][1680 2000-200A 2028-2029 205F 3000]',
+    '(?li:[a-z])' => 'ANYOFL{i}[a-z{utf8 locale}A-Z\x{017F}\x{212A}]',
 
     '\p{All}' => 'SANY',
     '\P{All}' => 'OPFAIL',
diff --git a/t/re/fold_grind.pl b/t/re/fold_grind.pl
index 4082bf7e32..fa775da910 100644
--- a/t/re/fold_grind.pl
+++ b/t/re/fold_grind.pl
@@ -667,7 +667,11 @@ foreach my $test (sort { numerically } keys %tests) {
           next if $pattern_above_latin1 && ! $utf8_pattern;
 
           # Our testing of 'l' uses the POSIX locale, which is ASCII-only
-          my $uni_semantics = $charset ne 'l' && ($utf8_target || $charset eq 'u' || $charset eq 'L' || ($charset eq 'd' && $utf8_pattern) || $charset =~ /a/);
+          my $uni_semantics = $charset ne 'l' && (    $utf8_target
+                                                  ||  $charset eq 'u'
+                                                  ||  $charset eq 'L'
+                                                  || ($charset eq 'd' && $utf8_pattern)
+                                                  ||  $charset =~ /a/);
           my $upgrade_pattern = "";
           $upgrade_pattern = ' utf8::upgrade($p);' if ! $pattern_above_latin1 && $utf8_pattern;
 
diff --git a/unicode_constants.h b/unicode_constants.h
index d5a410fc48..b44fed5ae9 100644
--- a/unicode_constants.h
+++ b/unicode_constants.h
@@ -54,9 +54,7 @@ bytes.
 #   define LATIN_SMALL_LETTER_LONG_S_UTF8  "\xC5\xBF"    /* U+017F */
 
 #   define COMBINING_GRAVE_ACCENT_UTF8  "\xCC\x80"    /* U+0300 */
-
-#   define GREEK_CAPITAL_LETTER_IOTA_UTF8  "\xCE\x99"    /* U+0399 */
-#   define GREEK_SMALL_LETTER_MU_UTF8  "\xCE\xBC"    /* U+03BC */
+#   define COMBINING_DOT_ABOVE_UTF8  "\xCC\x87"    /* U+0307 */
 
 #   define LATIN_CAPITAL_LETTER_SHARP_S_UTF8  "\xE1\xBA\x9E"    /* U+1E9E */
 
@@ -99,9 +97,7 @@ bytes.
 #   define LATIN_SMALL_LETTER_LONG_S_UTF8  "\x8F\x73"    /* U+017F */
 
 #   define COMBINING_GRAVE_ACCENT_UTF8  "\xAF\x41"    /* U+0300 */
-
-#   define GREEK_CAPITAL_LETTER_IOTA_UTF8  "\xB3\x68"    /* U+0399 */
-#   define GREEK_SMALL_LETTER_MU_UTF8  "\xB4\x70"    /* U+03BC */
+#   define COMBINING_DOT_ABOVE_UTF8  "\xAF\x48"    /* U+0307 */
 
 #   define LATIN_CAPITAL_LETTER_SHARP_S_UTF8  "\xBF\x63\x72"    /* U+1E9E */
 
@@ -144,9 +140,7 @@ bytes.
 #   define LATIN_SMALL_LETTER_LONG_S_UTF8  "\x8E\x72"    /* U+017F */
 
 #   define COMBINING_GRAVE_ACCENT_UTF8  "\xAD\x41"    /* U+0300 */
-
-#   define GREEK_CAPITAL_LETTER_IOTA_UTF8  "\xB2\x67"    /* U+0399 */
-#   define GREEK_SMALL_LETTER_MU_UTF8  "\xB3\x6A"    /* U+03BC */
+#   define COMBINING_DOT_ABOVE_UTF8  "\xAD\x48"    /* U+0307 */
 
 #   define LATIN_CAPITAL_LETTER_SHARP_S_UTF8  "\xBF\x62\x71"    /* U+1E9E */
 

-- 
Perl5 Master Repository



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About