develooper Front page | perl.perl5.changes | Postings from November 2010

[perl.git] branch blead, updated. v5.13.7-164-gfad448f

From:
Father Chrysostomos
Date:
November 28, 2010 04:50
Subject:
[perl.git] branch blead, updated. v5.13.7-164-gfad448f
Message ID:
E1PMghS-0001Nl-KW@camel.ams6.corp.booking.com
In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/fad448f483699c785bb136aef39370fc62659f63?hp=e3ef43a5d915565241c254f7c7be154729e81162>

- Log -----------------------------------------------------------------
commit fad448f483699c785bb136aef39370fc62659f63
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 22:14:05 2010 -0700

    reg_fold.t: Add tests for simple Latin1 folds
    
    fold_grind.t does a comprehensive series of tests, but it doesn't test
    most characters, just a representative sample.  Add tests to reg_fold.t
    to verify that the basic mapping tables work.

M	t/re/reg_fold.t

commit 8bfc9fab7a860a6f36ba9cf5775404466dde58be
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 22:12:13 2010 -0700

    reg_fold.t: Disable single char fold testing
    
    This is now more comprehensively done by fold_grind.t

M	t/re/reg_fold.t

commit 1ef17b7294e4839ff6f8051fb15eaa566aed4dd1
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 22:10:35 2010 -0700

    fold_grind.t: Enable EXACTish tests

M	t/re/fold_grind.t

commit 883cce4a0d59311837bcaf9dd6a2ba7be9b70cbe
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 22:08:57 2010 -0700

    test.pl: Add native to uni ord fcns

M	t/test.pl

commit 970c8436e45f2ba7c40e23a63f735480772f88fe
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 21:16:51 2010 -0700

    regcomp.c: Handle EXACTFU nodes in optimizer
    
    This patch also changes the optimizer to include the other member of a
    fold pair in the bitmap.  Thus if 'b' is set under /i, so will 'B', and
    vice versa.

M	regcomp.c

commit 9ce2357ee574b9377c898a552cb981a078f1722d
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 20:51:54 2010 -0700

    regcomp.c: Use hex instead of octal for debug ords
    
    The ordinals that are output in the debugging output have been in octal,
    which is ok for the low controls, but for above Latin1, the standard is
    hex, so this changes them all to correspond.  If desired the low
    controls could be changed back to be in octal.

M	regcomp.c

commit f5ecd18d6d228e264c76bed78eb035ba36b01b40
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 20:49:49 2010 -0700

    Fix debug output
    
    The 'outside bitmap' message isn't orthogonal to the others, it is
    independent.

M	regcomp.c

commit 62012aee33c9aaffb325cdf706d389d952eb13ad
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 20:45:33 2010 -0700

    regcomp.c: Typo in comment

M	regcomp.c

commit 2c2b7f86ebc8d80b481174d93d0dc74515f6f2a5
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 20:37:05 2010 -0700

    regcomp.c: Generate EXACTFU nodes

M	regcomp.c

commit e5fbd0ffcd6dd4bd246d7aed5260f5a8d8ee5727
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 19:40:58 2010 -0700

    regcomp.c: remove unnecessary tests
    
    The tests in the else are unnecessary as they comprise everything else
    but what the 'if' says.

M	regcomp.c

commit 9a5a5549bccf236f311169910ca2634a80483a88
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 15:26:31 2010 -0700

    regexec.c: Add handling for EXACTFU nodes
    
    A later commit will cause these nodes to be generated.
    
    This commit changes how to find the handling of the various nodes to
    switch statements, hopefully for efficiency.

M	regexec.c

commit 873813865d3bbe2cabafb831683ea56920e540d0
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 15:19:31 2010 -0700

    regexec.c: pull array lookup out of loop

M	regexec.c

commit 9dcbe121586cbfc2982c3fef4c17842e23343f68
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 15:16:47 2010 -0700

    utf8.h: Add #define for Greek small MU

M	utf8.h

commit 2be3e1907e6f6fea92dc9617cb0d21982ebfbe7f
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 15:08:34 2010 -0700

    regexec.c: clarify comments

M	regexec.c

commit 35f7ba49a4b293d62e73b530a7ea878959ba2ab2
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 10:57:01 2010 -0700

    regcomp.sym: Add EXACTFU regnode
    
    This node will be used for matching case insensitive exactish nodes
    using Unicode semantics

M	regcomp.sym
M	regnodes.h

commit e2e755386e52b4bdb22a5c9618390859ed5f7323
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 10:26:01 2010 -0700

    regexec.c: Latin1 chars can fold match UTF8_ALL
    
    Some ANYOF regnodes have the ANYOF_UNICODE_ALL flag set, which means
    they match any non-Latin1 character.  These should match /i (in a utf8
    target string) any ASCII or Latin1 character that folds outside the
    Latin1 range
    
    As part of this patch, an internal only macro is renamed to account for its
    new use in regexec.c.  The cumbersome name is to ward off others from
    using it until the final semantics have been settled on.

M	handy.h
M	regcomp.c
M	regexec.c

commit 35bae5983226710c4c6eb1565b5e105b0599a820
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 10:10:52 2010 -0700

    fold_grind.t: Make unicode semantics pattern differently
    
    This changes to use 'use re "/u"' to specify that a pattern is supposed
    to match with unicode semantics, instead of upgrading it to utf8.  The
    variable name changes accordingly

M	t/re/fold_grind.t

commit 1b9f127b4dff87fa8c6be8acf741af73f72fea81
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sat Nov 27 09:42:48 2010 -0700

    Add Perl_foldEQ_latin1()
    
    This function compares two non-utf8 strings to see if they are
    equivalent without regards to case differences.  It does not work nor
    check for  three problematic code points that require special handling:
    MICRO_SIGN, LATIN_SMALL_LETTER_SHARP_S, and
    LATIN_SMALL_LETTER_Y_WITH_DIAERESIS.
    
    make regen required

M	embed.fnc
M	embed.h
M	global.sym
M	proto.h
M	util.c
-----------------------------------------------------------------------

Summary of changes:
 embed.fnc         |    1 +
 embed.h           |    1 +
 global.sym        |    1 +
 handy.h           |    2 +-
 proto.h           |    7 +++
 regcomp.c         |   74 +++++++++++++++++++++------
 regcomp.sym       |    1 +
 regexec.c         |  145 +++++++++++++++++++++++++++++++++-------------------
 regnodes.h        |   17 ++++--
 t/re/fold_grind.t |   14 +++---
 t/re/reg_fold.t   |   49 ++++++++++++++++++
 t/test.pl         |   13 +++++
 utf8.h            |    1 +
 util.c            |   21 ++++++++
 14 files changed, 264 insertions(+), 83 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index 9effd6b..fe8f43c 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -517,6 +517,7 @@ Am	|I32	|ibcmp_utf8	|NN const char *s1|NULLOK char **pe1|UV l1 \
 Apd	|I32	|foldEQ_utf8	|NN const char *s1|NULLOK char **pe1|UV l1 \
 				|bool u1|NN const char *s2|NULLOK char **pe2 \
 				|UV l2|bool u2
+AnpP	|I32	|foldEQ_latin1	|NN const char* a|NN const char* b|I32 len
 #if defined(PERL_IN_DOIO_C)
 sR	|bool	|ingroup	|Gid_t testgid|bool effective
 #endif
diff --git a/embed.h b/embed.h
index 441b6a5..d484a10 100644
--- a/embed.h
+++ b/embed.h
@@ -132,6 +132,7 @@
 #define find_rundefsv()		Perl_find_rundefsv(aTHX)
 #define find_rundefsvoffset()	Perl_find_rundefsvoffset(aTHX)
 #define foldEQ			Perl_foldEQ
+#define foldEQ_latin1		Perl_foldEQ_latin1
 #define foldEQ_locale		Perl_foldEQ_locale
 #define foldEQ_utf8(a,b,c,d,e,f,g,h)	Perl_foldEQ_utf8(aTHX_ a,b,c,d,e,f,g,h)
 #ifndef PERL_IMPLICIT_CONTEXT
diff --git a/global.sym b/global.sym
index 007ed52..7e8f38b 100644
--- a/global.sym
+++ b/global.sym
@@ -128,6 +128,7 @@ Perl_find_runcv
 Perl_find_rundefsv
 Perl_find_rundefsvoffset
 Perl_foldEQ
+Perl_foldEQ_latin1
 Perl_foldEQ_locale
 Perl_foldEQ_utf8
 Perl_form
diff --git a/handy.h b/handy.h
index 391156a..216d0ea 100644
--- a/handy.h
+++ b/handy.h
@@ -609,7 +609,7 @@ patched there.  The file as of this writing is cpan/Devel-PPPort/parts/inc/misc
 #   define isUPPER_A(c)  cBOOL(FITS_IN_8_BITS(c) && (PL_charclass[(U8) NATIVE_TO_UNI(c)] & _CC_UPPER_A))
 #   define isWORDCHAR_A(c) cBOOL(FITS_IN_8_BITS(c) && (PL_charclass[(U8) NATIVE_TO_UNI(c)] & _CC_WORDCHAR_A))
 #   define isXDIGIT_A(c)  cBOOL(FITS_IN_8_BITS(c) && (PL_charclass[(U8) NATIVE_TO_UNI(c)] & _CC_XDIGIT_A))
-#   define _NONLATIN1_FOLD_CLOSURE_ONLY_FOR_USE_BY_REGCOMP_DOT_C(c) ((! cBOOL(FITS_IN_8_BITS(c))) || (PL_charclass[(U8) NATIVE_TO_UNI(c)] & _CC_NONLATIN1_FOLD))
+#   define _HAS_NONLATIN1_FOLD_CLOSURE_ONLY_FOR_USE_BY_REGCOMP_DOT_C_AND_REGEXEC_DOT_C(c) ((! cBOOL(FITS_IN_8_BITS(c))) || (PL_charclass[(U8) NATIVE_TO_UNI(c)] & _CC_NONLATIN1_FOLD))
 #else   /* No perl.h. */
 #   define isOCTAL_A(c)  ((c) >= '0' && (c) <= '9')
 #   ifdef EBCDIC
diff --git a/proto.h b/proto.h
index 096e84f..a05f2b9 100644
--- a/proto.h
+++ b/proto.h
@@ -968,6 +968,13 @@ PERL_CALLCONV I32	Perl_foldEQ(const char* a, const char* b, I32 len)
 #define PERL_ARGS_ASSERT_FOLDEQ	\
 	assert(a); assert(b)
 
+PERL_CALLCONV I32	Perl_foldEQ_latin1(const char* a, const char* b, I32 len)
+			__attribute__pure__
+			__attribute__nonnull__(1)
+			__attribute__nonnull__(2);
+#define PERL_ARGS_ASSERT_FOLDEQ_LATIN1	\
+	assert(a); assert(b)
+
 PERL_CALLCONV I32	Perl_foldEQ_locale(const char* a, const char* b, I32 len)
 			__attribute__pure__
 			__attribute__nonnull__(1)
diff --git a/regcomp.c b/regcomp.c
index 07834a0..4092d79 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -1475,8 +1475,8 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_state, regnode *startbranch, regnode *firs
                     TRIE_STORE_REVCHAR;
                 }
                 if ( set_bit ) {
-                    /* store the codepoint in the bitmap, and if its ascii
-                       also store its folded equivelent. */
+		    /* store the codepoint in the bitmap, and its folded
+		     * equivalent. */
                     TRIE_BITMAP_SET(trie,uvc);
 
 		    /* store the folded codepoint */
@@ -2451,8 +2451,11 @@ S_join_exact(pTHX_ RExC_state_t *pRExC_state, regnode *scan, I32 *min, U32 flags
 	}
 #endif
     }
-    
-    if (UTF && ( OP(scan) == EXACTF ) && ( STR_LEN(scan) >= 6 ) ) {
+
+    if (UTF
+	&& ( OP(scan) == EXACTF || OP(scan) == EXACTFU)
+	&& ( STR_LEN(scan) >= 6 ) )
+    {
     /*
     Two problematic code points in Unicode casefolding of EXACT nodes:
     
@@ -3074,7 +3077,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 		    (!(data->start_class->flags & (ANYOF_CLASS | ANYOF_LOCALE))
 		    && !ANYOF_BITMAP_TEST(data->start_class, uc)
 		    && (!(data->start_class->flags & ANYOF_FOLD)
-			|| !ANYOF_BITMAP_TEST(data->start_class, PL_fold[uc])))
+			|| !ANYOF_BITMAP_TEST(data->start_class, (UNI_SEMANTICS) ? PL_fold_latin1[uc] : PL_fold[uc])))
                     )
 		    compat = 0;
 		ANYOF_CLASS_ZERO(data->start_class);
@@ -3120,7 +3123,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 		if (uc >= 0x100 ||
 		    (!(data->start_class->flags & (ANYOF_CLASS | ANYOF_LOCALE))
 		    && !ANYOF_BITMAP_TEST(data->start_class, uc)
-		     && !ANYOF_BITMAP_TEST(data->start_class, PL_fold[uc])))
+		     && !ANYOF_BITMAP_TEST(data->start_class, (UNI_SEMANTICS) ? PL_fold_latin1[uc] : PL_fold[uc])))
 		    compat = 0;
 		ANYOF_CLASS_ZERO(data->start_class);
 		ANYOF_BITMAP_ZERO(data->start_class);
@@ -3128,16 +3131,38 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 		    ANYOF_BITMAP_SET(data->start_class, uc);
 		    data->start_class->flags &= ~ANYOF_EOS;
 		    data->start_class->flags |= ANYOF_FOLD;
-		    if (OP(scan) == EXACTFL)
+		    if (OP(scan) == EXACTFL) {
 			data->start_class->flags |= ANYOF_LOCALE;
+		    }
+		    else {
+
+			/* Also set the other member of the fold pair.  Can't
+			 * do this for locale, because not known until runtime
+			 */
+			ANYOF_BITMAP_SET(data->start_class,
+					 (OP(scan) == EXACTFU)
+						    ? PL_fold_latin1[uc]
+						    : PL_fold[uc]);
+		    }
 		}
 	    }
 	    else if (flags & SCF_DO_STCLASS_OR) {
 		if (data->start_class->flags & ANYOF_FOLD) {
 		    /* false positive possible if the class is case-folded.
 		       Assume that the locale settings are the same... */
-		    if (uc < 0x100)
+		    if (uc < 0x100) {
 			ANYOF_BITMAP_SET(data->start_class, uc);
+                        if (OP(scan) != EXACTFL) {
+
+                            /* And set the other member of the fold pair, but
+                             * can't do that in locale because not known until
+                             * run-time */
+                            ANYOF_BITMAP_SET(data->start_class,
+                                            (OP(scan) == EXACTFU)
+                                                        ? PL_fold_latin1[uc]
+                                                        : PL_fold[uc]);
+                        }
+		    }
 		    data->start_class->flags &= ~ANYOF_EOS;
 		}
 		cl_and(data->start_class, and_withp);
@@ -4724,7 +4749,7 @@ reStudy:
 	if (PL_regkind[OP(first)] == EXACT) {
 	    if (OP(first) == EXACT)
 		NOOP;	/* Empty, get anchored substr later. */
-	    else if ((OP(first) == EXACTF || OP(first) == EXACTFL))
+	    else
 		ri->regstclass = first;
 	}
 #ifdef TRIE_STCLASS	
@@ -6978,8 +7003,12 @@ S_reg_namedseq(pTHX_ RExC_state_t *pRExC_state, UV *valuep, I32 *flagp)
 	char *endchar;	    /* Points to '.' or '}' ending cur char in the input
 			       stream */
 
-	ret = reg_node(pRExC_state,
-			(U8)(FOLD ? (LOC ? EXACTFL : EXACTF) : EXACT));
+	ret = reg_node(pRExC_state, (U8) ((! FOLD) ? EXACT
+						   : (LOC)
+						      ? EXACTFL
+						      : UNI_SEMANTICS
+						        ? EXACTFU
+						        : EXACTF));
 	s= STRING(ret);
 
 	/* Exact nodes can hold only a U8 length's of text = 255.  Loop through
@@ -7585,7 +7614,13 @@ tryagain:
 	defchar:
 	    ender = 0;
 	    ret = reg_node(pRExC_state,
-			   (U8)(FOLD ? (LOC ? EXACTFL : EXACTF) : EXACT));
+			   (U8) ((! FOLD) ? EXACT
+					  : (LOC)
+					     ? EXACTFL
+					     : (UNI_SEMANTICS)
+					       ? EXACTFU
+					       : EXACTF)
+		    );
 	    s = STRING(ret);
 	    for (len = 0, p = RExC_parse - 1;
 	      len < 127 && p < RExC_end;
@@ -8184,7 +8219,7 @@ S_set_regclass_bit_fold(pTHX_ RExC_state_t *pRExC_state, regnode* node, const U8
     if (UNI_SEMANTICS && value == LATIN_SMALL_LETTER_SHARP_S) {
 	ANYOF_FLAGS(node) |= ANYOF_NONBITMAP_NON_UTF8;
     }
-    else if (_NONLATIN1_FOLD_CLOSURE_ONLY_FOR_USE_BY_REGCOMP_DOT_C(value)
+    else if (_HAS_NONLATIN1_FOLD_CLOSURE_ONLY_FOR_USE_BY_REGCOMP_DOT_C_AND_REGEXEC_DOT_C(value)
 	     || (! UNI_SEMANTICS
                  && ! isASCII(value)
                  && PL_fold_latin1[value] != value))
@@ -9264,6 +9299,7 @@ S_regtail_study(pTHX_ RExC_state_t *pRExC_state, regnode *p, const regnode *val,
             switch (OP(scan)) {
                 case EXACT:
                 case EXACTF:
+                case EXACTFU:
                 case EXACTFL:
                         if( exact == PSEUDO )
                             exact= OP(scan);
@@ -9645,7 +9681,7 @@ Perl_regprop(pTHX_ const regexp *prog, SV *sv, const regnode *o)
 	    sv_catpvs(sv, "{unicode_all}");
 	else if (flags & ANYOF_UTF8)
 	    sv_catpvs(sv, "{unicode}");
-	else if (flags & ANYOF_NONBITMAP)
+	if (flags & ANYOF_NONBITMAP_NON_UTF8)
 	    sv_catpvs(sv, "{outside bitmap}");
 
 	{
@@ -10351,8 +10387,14 @@ S_put_byte(pTHX_ SV *sv, int c)
        ones (binary 1111 1111, hexadecimal FF). It is similar, but not
        identical, to the ASCII delete (DEL) or rubout control character.
        ) So the old condition can be simplified to !isPRINT(c)  */
-    if (!isPRINT(c))
-	Perl_sv_catpvf(aTHX_ sv, "\\%o", c);
+    if (!isPRINT(c)) {
+	if (c < 256) {
+	    Perl_sv_catpvf(aTHX_ sv, "\\x%02x", c);
+	}
+	else {
+	    Perl_sv_catpvf(aTHX_ sv, "\\x{%x}", c);
+	}
+    }
     else {
 	const char string = c;
 	if (c == '-' || c == ']' || c == '\\' || c == '^')
diff --git a/regcomp.sym b/regcomp.sym
index 63e66e0..a85d33f 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -192,6 +192,7 @@ HORIZWS     HORIZWS,    none 0 S  ; horizontal whitespace       (Perl 6)
 NHORIZWS    NHORIZWS,   none 0 S  ; not horizontal whitespace   (Perl 6)
 
 FOLDCHAR    FOLDCHAR,   codepoint 1 ; codepoint with tricky case folding properties.
+EXACTFU     EXACT,      str	    ; Match this string, folded, Unicode semantics for non-utf8 (prec. by length).
 
 # NEW STUFF ABOVE THIS LINE  
 
diff --git a/regexec.c b/regexec.c
index 375d4fd..129b297 100644
--- a/regexec.c
+++ b/regexec.c
@@ -297,12 +297,13 @@
 /* Currently these are only used when PL_regkind[OP(rn)] == EXACT so
    we don't need this definition. */
 #define IS_TEXT(rn)   ( OP(rn)==EXACT   || OP(rn)==REF   || OP(rn)==NREF   )
-#define IS_TEXTF(rn)  ( OP(rn)==EXACTF  || OP(rn)==REFF  || OP(rn)==NREFF  )
+#define IS_TEXTF(rn)  ( (OP(rn)==EXACTFU ||  OP(rn)==EXACTF)  || OP(rn)==REFF  || OP(rn)==NREFF )
 #define IS_TEXTFL(rn) ( OP(rn)==EXACTFL || OP(rn)==REFFL || OP(rn)==NREFFL )
 
 #else
 /* ... so we use this as its faster. */
 #define IS_TEXT(rn)   ( OP(rn)==EXACT   )
+#define IS_TEXTFU(rn)  ( OP(rn)==EXACTFU )
 #define IS_TEXTF(rn)  ( OP(rn)==EXACTF  )
 #define IS_TEXTFL(rn) ( OP(rn)==EXACTFL )
 
@@ -1054,7 +1055,7 @@ Perl_re_intuit_start(pTHX_ REGEXP * const rx, SV *sv, char *strpos,
 	   even for \b or \B.  But (minlen? 1 : 0) below assumes that
 	   regstclass does not come from lookahead...  */
 	/* If regstclass takes bytelength more than 1: If charlength==1, OK.
-	   This leaves EXACTF only, which is dealt with in find_byclass().  */
+	   This leaves EXACTF, EXACTFU only, which are dealt with in find_byclass().  */
         const U8* const str = (U8*)STRING(progi->regstclass);
         const int cl_l = (PL_regkind[OP(progi->regstclass)] == EXACT
 		    ? CHR_DIST(str+STR_LEN(progi->regstclass), str)
@@ -1244,12 +1245,18 @@ s += len
 
 #define REXEC_FBC_EXACTISH_SCAN(CoNd)                     \
 STMT_START {                                              \
+    I32 (*folder)();                                      \
+    switch (OP(c)) {                                      \
+	case EXACTFU: folder = foldEQ_latin1; break;      \
+	case EXACTFL: folder = foldEQ_locale; break;      \
+	case EXACTF:  folder = foldEQ; break;             \
+	default:                                          \
+	    Perl_croak(aTHX_ "panic: Unexpected op %u", OP(c)); \
+    }                                                     \
     while (s <= e) {                                      \
 	if ( (CoNd)                                       \
-	     && (ln == 1 || (OP(c) == EXACTF             \
-			      ? foldEQ(s, m, ln)           \
-			      : foldEQ_locale(s, m, ln)))  \
-	     && (!reginfo || regtry(reginfo, &s)) )        \
+	     && (ln == 1 || folder(s, m, ln))             \
+	     && (!reginfo || regtry(reginfo, &s)) )       \
 	    goto got_it;                                  \
 	s++;                                              \
     }                                                     \
@@ -1392,6 +1399,7 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
 		    tmp = doevery;
 	    );
 	    break;
+	case EXACTFU:
 	case EXACTF:
 	    m   = STRING(c);
 	    ln  = STR_LEN(c);	/* length to match in octets/bytes */
@@ -1431,7 +1439,18 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
 	    }
 	    else {
 		c1 = *(U8*)m;
-		c2 = PL_fold[c1];
+		if (utf8_target || OP(c) == EXACTFU) {
+
+		    /* Micro sign folds to GREEK SMALL LETTER MU;
+		       LATIN_SMALL_LETTER_SHARP_S folds to 'ss', and this sets
+		       c2 to the first 's' of the pair, and the code below will
+		       look for others */
+		    c2 = (c1 == MICRO_SIGN)
+			? GREEK_SMALL_LETTER_MU
+			: (c1 == LATIN_SMALL_LETTER_SHARP_S)
+			   ? 's'
+			   : PL_fold_latin1[c1];
+		} else c2 = PL_fold[c1];
 	    }
 	    goto do_exactf;
 	case EXACTFL:
@@ -3538,11 +3557,27 @@ S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog)
 	    nextchr = UCHARAT(locinput);
 	    break;
 	    }
-	case EXACTFL:
+	case EXACTFL: {
+	    I32 (*folder)();                                      \
+	    const U8 * fold_array;
+	    const char * s;
+
 	    PL_reg_flags |= RF_tainted;
-	    /* FALL THROUGH */
-	case EXACTF: {
-	    char * const s = STRING(scan);
+	    folder = foldEQ_locale;
+	    fold_array = PL_fold_locale;
+	    goto do_exactf;
+
+	case EXACTFU:
+	    folder = foldEQ_latin1;
+	    fold_array = PL_fold_latin1;
+	    goto do_exactf;
+
+	case EXACTF:
+	    folder = foldEQ;
+	    fold_array = PL_fold;
+
+	  do_exactf:
+	    s = STRING(scan);
 	    ln = STR_LEN(scan);
 
 	    if (utf8_target || UTF_PATTERN) {
@@ -3575,19 +3610,18 @@ S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog)
 
 	    /* Inline the first character, for speed. */
 	    if (UCHARAT(s) != nextchr &&
-		UCHARAT(s) != ((OP(scan) == EXACTF)
-			       ? PL_fold : PL_fold_locale)[nextchr])
+		UCHARAT(s) != fold_array[nextchr])
+	    {
 		sayNO;
+	    }
 	    if (PL_regeol - locinput < ln)
 		sayNO;
-	    if (ln > 1 && (OP(scan) == EXACTF
-			   ? ! foldEQ(s, locinput, ln)
-			   : ! foldEQ_locale(s, locinput, ln)))
+	    if (ln > 1 && ! folder(s, locinput, ln))
 		sayNO;
 	    locinput += ln;
 	    nextchr = UCHARAT(locinput);
 	    break;
-	    }
+	}
 	case BOUNDL:
 	case NBOUNDL:
 	    PL_reg_flags |= RF_tainted;
@@ -4850,12 +4884,12 @@ NULL
 		    {
 		        
 			ST.c1 = (U8)*STRING(text_node);
-			ST.c2 =
-			    (IS_TEXTF(text_node))
-			    ? PL_fold[ST.c1]
-			    : (IS_TEXTFL(text_node))
-				? PL_fold_locale[ST.c1]
-				: ST.c1;
+			switch (OP(text_node)) {
+			    case EXACTF: ST.c2 = PL_fold[ST.c1]; break;
+			    case EXACTFU: ST.c2 = PL_fold_latin1[ST.c1]; break;
+			    case EXACTFL: ST.c2 = PL_fold_locale[ST.c1]; break;
+			    default: ST.c2 = ST.c1;
+			}
 		    }
 		}
 	    }
@@ -5002,14 +5036,16 @@ NULL
                         if this changes back then the macro for IS_TEXT and 
                         friends need to change. */
 		    if (!UTF_PATTERN) {
-			ST.c2 = ST.c1 = *s;
-			if (IS_TEXTF(text_node))
-			    ST.c2 = PL_fold[ST.c1];
-			else if (IS_TEXTFL(text_node))
-			    ST.c2 = PL_fold_locale[ST.c1];
+			ST.c1 = *s;
+			switch (OP(text_node)) {
+			    case EXACTF: ST.c2 = PL_fold[ST.c1]; break;
+			    case EXACTFU: ST.c2 = PL_fold_latin1[ST.c1]; break;
+			    case EXACTFL: ST.c2 = PL_fold_locale[ST.c1]; break;
+			    default: ST.c2 = ST.c1; break;
+			}
 		    }
 		    else { /* UTF_PATTERN */
-			if (IS_TEXTF(text_node)) {
+			if (IS_TEXTFU(text_node) || IS_TEXTF(text_node)) {
 			     STRLEN ulen1, ulen2;
 			     U8 tmpbuf1[UTF8_MAXBYTES_CASE+1];
 			     U8 tmpbuf2[UTF8_MAXBYTES_CASE+1];
@@ -5802,8 +5838,10 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
 	PL_reg_flags |= RF_tainted;
 	/* FALL THROUGH */
     case EXACTF:
+    case EXACTFU:
 
-	/* The comments for the EXACT case apply as well to these fold ones */
+	/* The comments for the EXACT case above apply as well to these fold
+	 * ones */
 
 	c = (U8)*STRING(p);
 	assert(! UTF_PATTERN || UNI_IS_INVARIANT(c));
@@ -5835,27 +5873,22 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth)
 	     * deferred */
 	}
 	else {
+	    U8 folded;
 
-	    /* Here, the string isn't utf8; and either the pattern isn't utf8
-	     * or c is an invariant, so its utf8ness doesn't affect c.  Can
-	     * just do simple comparisons for exact or fold matching. */
+	    /* Here, the string isn't utf8 and c is a single byte; and either
+	     * the pattern isn't utf8 or c is an invariant, so its utf8ness
+	     * doesn't affect c.  Can just do simple comparisons for exact or
+	     * fold matching. */
 	    switch (OP(p)) {
-	    case EXACTF:
-		while (scan < loceol &&
-		    (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold[c]))
-		{
-		    scan++;
-		}
-		break;
-	    case EXACTFL:
-		while (scan < loceol &&
-		    (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold_locale[c]))
-		{
-		    scan++;
-		}
-		break;
-	    default:
-		Perl_croak(aTHX_ "panic: Unexpected op %u", OP(p));
+		case EXACTF: folded = PL_fold[c]; break;
+		case EXACTFU: folded = PL_fold_latin1[c]; break;
+		case EXACTFL: folded = PL_fold_locale[c]; break;
+		default: Perl_croak(aTHX_ "panic: Unexpected op %u", OP(p));
+	    }
+	    while (scan < loceol &&
+		   (UCHARAT(scan) == c || UCHARAT(scan) == folded))
+	    {
+		scan++;
 	    }
 	}
 	break;
@@ -6300,11 +6333,17 @@ S_reginclass(pTHX_ const regexp * const prog, register const regnode * const n,
     /* If the bitmap didn't (or couldn't) match, and something outside the
      * bitmap could match, try that */
     if (!match) {
-	if (utf8_target && (flags & ANYOF_UNICODE_ALL) && c >= 256) {
-	    match = TRUE;
+	if (utf8_target && (flags & ANYOF_UNICODE_ALL)) {
+	    if (c >= 256
+		|| ((flags & ANYOF_FOLD) /* Latin1 1 that has a non-Latin1 fold
+					    should match */
+		    && _HAS_NONLATIN1_FOLD_CLOSURE_ONLY_FOR_USE_BY_REGCOMP_DOT_C_AND_REGEXEC_DOT_C(c)))
+	    {
+		match = TRUE;
+	    }
 	}
-	else if ((flags & ANYOF_NONBITMAP_NON_UTF8)
-		 || (utf8_target && flags & ANYOF_UTF8))
+	if (!match && ((flags & ANYOF_NONBITMAP_NON_UTF8)
+		       || (utf8_target && flags & ANYOF_UTF8)))
 	{
 	    AV *av;
 	    SV * const sw = regclass_swash(prog, n, TRUE, 0, (SV**)&av);
diff --git a/regnodes.h b/regnodes.h
index d3338c2..97ac607 100644
--- a/regnodes.h
+++ b/regnodes.h
@@ -6,8 +6,8 @@
 
 /* Regops and State definitions */
 
-#define REGNODE_MAX           	90
-#define REGMATCH_STATE_MAX    	130
+#define REGNODE_MAX           	91
+#define REGMATCH_STATE_MAX    	131
 
 #define	END                   	0	/* 0000 End of program. */
 #define	SUCCEED               	1	/* 0x01 Return from a subroutine, basically. */
@@ -98,8 +98,9 @@
 #define	HORIZWS               	86	/* 0x56 horizontal whitespace       (Perl 6) */
 #define	NHORIZWS              	87	/* 0x57 not horizontal whitespace   (Perl 6) */
 #define	FOLDCHAR              	88	/* 0x58 codepoint with tricky case folding properties. */
-#define	OPTIMIZED             	89	/* 0x59 Placeholder for dump. */
-#define	PSEUDO                	90	/* 0x5a Pseudo opcode for internal use. */
+#define	EXACTFU               	89	/* 0x59 Match this string, folded, Unicode semantics for non-utf8 (prec. by length). */
+#define	OPTIMIZED             	90	/* 0x5a Placeholder for dump. */
+#define	PSEUDO                	91	/* 0x5b Pseudo opcode for internal use. */
 	/* ------------ States ------------- */
 #define	TRIE_next             	(REGNODE_MAX + 1)	/* state for TRIE */
 #define	TRIE_next_fail        	(REGNODE_MAX + 2)	/* state for TRIE */
@@ -237,6 +238,7 @@ EXTCONST U8 PL_regkind[] = {
 	HORIZWS,  	/* HORIZWS                */
 	NHORIZWS, 	/* NHORIZWS               */
 	FOLDCHAR, 	/* FOLDCHAR               */
+	EXACT,    	/* EXACTFU                */
 	NOTHING,  	/* OPTIMIZED              */
 	PSEUDO,   	/* PSEUDO                 */
 	/* ------------ States ------------- */
@@ -376,6 +378,7 @@ static const U8 regarglen[] = {
 	0,                                   	/* HORIZWS      */
 	0,                                   	/* NHORIZWS     */
 	EXTRA_SIZE(struct regnode_1),        	/* FOLDCHAR     */
+	0,                                   	/* EXACTFU      */
 	0,                                   	/* OPTIMIZED    */
 	0,                                   	/* PSEUDO       */
 };
@@ -472,6 +475,7 @@ static const char reg_off_by_arg[] = {
 	0,	/* HORIZWS      */
 	0,	/* NHORIZWS     */
 	0,	/* FOLDCHAR     */
+	0,	/* EXACTFU      */
 	0,	/* OPTIMIZED    */
 	0,	/* PSEUDO       */
 };
@@ -573,8 +577,9 @@ EXTCONST char * const PL_reg_name[] = {
 	"HORIZWS",               	/* 0x56 */
 	"NHORIZWS",              	/* 0x57 */
 	"FOLDCHAR",              	/* 0x58 */
-	"OPTIMIZED",             	/* 0x59 */
-	"PSEUDO",                	/* 0x5a */
+	"EXACTFU",               	/* 0x59 */
+	"OPTIMIZED",             	/* 0x5a */
+	"PSEUDO",                	/* 0x5b */
 	/* ------------ States ------------- */
 	"TRIE_next",             	/* REGNODE_MAX +0x01 */
 	"TRIE_next_fail",        	/* REGNODE_MAX +0x02 */
diff --git a/t/re/fold_grind.t b/t/re/fold_grind.t
index 07b0a1b..13fdd3c 100644
--- a/t/re/fold_grind.t
+++ b/t/re/fold_grind.t
@@ -249,16 +249,16 @@ foreach my $test (sort { numerically } keys %tests) {
         next if $target_above_latin1 && ! $utf8_target;
         $upgrade_target = '; utf8::upgrade($c)' if ! $target_above_latin1 && $utf8_target;
 
-        foreach my $utf8_pattern (0, 1) {
-          next if $pattern_above_latin1 && ! $utf8_pattern;
+        foreach my $uni_pattern (0, 1) {
+          next if $pattern_above_latin1 && ! $uni_pattern;
           my $upgrade_pattern = "";
-          $upgrade_pattern = '; utf8::upgrade($p)' if ! $pattern_above_latin1 && $utf8_pattern;
+          $upgrade_pattern = '; use re "/u"' if ! $pattern_above_latin1 && $uni_pattern;
 
           my $lhs = join "", @x_target;
           my @rhs = @x_pattern;
           #print "$lhs: ", "/@rhs/\n";
 
-          foreach my $bracketed (1) {   # Put rhs in [...], or not
+          foreach my $bracketed (0, 1) {   # Put rhs in [...], or not
             foreach my $inverted (0,1) {
                 next if $inverted && ! $bracketed;
 
@@ -316,7 +316,7 @@ foreach my $test (sort { numerically } keys %tests) {
                           #next unless $must_match;
                           my $quantified = "(?$uni_semantics:$l_anchor$prepend$interior${quantifier}$append$r_anchor)";
                           my $op;
-                          if ($must_match && ! $utf8_target && ! $utf8_pattern && ! $uni_semantics && $ord >= 128 && $ord < 256 && ! $is_self)  {
+                          if ($must_match && ! $utf8_target && ! $uni_pattern && ! $uni_semantics && $ord >= 128 && $ord < 256 && ! $is_self)  {
                               $op = 0;
                           } else {
                               $op = 1;
@@ -324,8 +324,8 @@ foreach my $test (sort { numerically } keys %tests) {
                           $op = ! $op if $must_match && $inverted;
                           $op = ($op) ? '=~' : '!~';
 
-                          my $stuff .= " utf8_target=$utf8_target, uni_semantics=$uni_semantics, utf8_pattern=$utf8_pattern, bracketed=$bracketed, prepend=$prepend, append=$append, parend=$parend, qu ... [62 chars truncated]
-                          my $eval = "my \$c = \"$prepend$lhs$append\"$upgrade_target; my \$p = qr/$quantified/i$upgrade_pattern; \$c $op \$p";
+                          my $stuff .= " utf8_target=$utf8_target, uni_semantics=$uni_semantics, uni_pattern=$uni_pattern, bracketed=$bracketed, prepend=$prepend, append=$append, parend=$parend, quan ... [60 chars truncated]
+                          my $eval = "my \$c = \"$prepend$lhs$append\"$upgrade_target; $upgrade_pattern; \$c $op /$quantified/i;";
 
                           # XXX Doesn't currently test multi-char folds
                           next if @pattern != 1;
diff --git a/t/re/reg_fold.t b/t/re/reg_fold.t
index 135b40e..ce84960 100644
--- a/t/re/reg_fold.t
+++ b/t/re/reg_fold.t
@@ -21,6 +21,7 @@ while (<$fh>) {
     my ($line,$comment)= split/\s+#\s+/, $_;
     my ($cp,$type,@folded)=split/[\s;]+/,$line||'';
     next unless $type and ($type eq 'F' or $type eq 'C');
+    next if $type eq 'C';   # 'C' tests now done by fold_grind.t
     my $fold_above_latin1 = grep { hex("0x$_") > 255 } @folded;
     $_="\\x{$_}" for @folded;
     my $cpv=hex("0x$cp");
@@ -73,6 +74,54 @@ while (<$fh>) {
     }
 }
 
+# Now verify the case folding tables.  First compute the mappings without
+# resorting to the functions we're testing.
+
+# Initialize the array so each $i maps to itself.
+my @fold_ascii;
+for my $i (0 .. 255) {
+    $fold_ascii[$i] = $i;
+}
+my @fold_latin1 = @fold_ascii;
+
+# Override the uppercase elements to fold to their lower case equivalents,
+# using the fact that 'A' in ASCII is 0x41, 'a' is 0x41+32, 'B' is 0x42, and
+# so on.  The same paradigm applies for most of the Latin1 range cased
+# characters, but in posix anything outside ASCII maps to itself, as we've
+# already set up.
+for my $i (0x41 .. 0x5A, 0xC0 .. 0xD6, 0xD8 .. 0xDE) {
+    my $upper_ord = ord_latin1_to_native($i);
+    my $lower_ord = ord_latin1_to_native($i + 32);
+
+    $fold_latin1[$upper_ord] = $lower_ord;
+
+    next if $i > 127;
+    $fold_ascii[$upper_ord] = $lower_ord;
+}
+
+# Same for folding lower to the upper equivalents
+for my $i (0x61 .. 0x7A, 0xE0 .. 0xF6, 0xF8 .. 0xFE) {
+    my $lower_ord = ord_latin1_to_native($i);
+    my $upper_ord = ord_latin1_to_native($i - 32);
+
+    $fold_latin1[$lower_ord] = $upper_ord;
+
+    next if $i > 127;
+    $fold_ascii[$lower_ord] = $upper_ord;
+}
+
+# Test every latin1 character that the correct values in both /u and /d
+for my $i (0 .. 255) {
+    my $chr = sprintf "\\x%02X", $i;
+    my $hex_fold_ascii = sprintf "0x%02X", $fold_ascii[$i];
+    my $hex_fold_latin1 = sprintf "0x%02X", $fold_latin1[$i];
+    push @tests, qq[like chr($hex_fold_ascii), qr/(?d:$chr)/i, 'chr($hex_fold_ascii) =~ qr/(?d:$chr)/i'];
+    $count++;
+    push @tests, qq[like chr($hex_fold_latin1), qr/(?u:$chr)/i, 'chr($hex_fold_latin1) =~ qr/(?u:$chr)/i'];
+    $count++;
+}
+
+
 push @tests, qq[like chr(0x0430), qr/[=\x{0410}-\x{0411}]/i, 'Bug #71752 Unicode /i char in a range'];
 $count++;
 push @tests, qq[like 'a', qr/\\p{Upper}/i, "'a' =~ /\\\\p{Upper}/i"];
diff --git a/t/test.pl b/t/test.pl
index 42b84cd..bfda110 100644
--- a/t/test.pl
+++ b/t/test.pl
@@ -1129,4 +1129,17 @@ sub latin1_to_native($) {
     return $string;
 }
 
+sub ord_latin1_to_native {
+    # given an input latin1 code point, return the platform's native
+    # equivalent value
+
+    return ord latin1_to_native(chr $_[0]);
+}
+
+sub ord_native_to_latin1 {
+    # given an input platform code point, return the latin1 equivalent value
+
+    return ord native_to_latin1(chr $_[0]);
+}
+
 1;
diff --git a/utf8.h b/utf8.h
index ef5fecc..6864b3a 100644
--- a/utf8.h
+++ b/utf8.h
@@ -275,6 +275,7 @@ Perl's extended UTF-8 means we can have start bytes up to FF.
 #define UNICODE_GREEK_CAPITAL_LETTER_SIGMA	0x03A3
 #define UNICODE_GREEK_SMALL_LETTER_FINAL_SIGMA	0x03C2
 #define UNICODE_GREEK_SMALL_LETTER_SIGMA	0x03C3
+#define GREEK_SMALL_LETTER_MU                   0x03BC
 
 #define UNI_DISPLAY_ISPRINT	0x0001
 #define UNI_DISPLAY_BACKSLASH	0x0002
diff --git a/util.c b/util.c
index f3c27f9..02861f0 100644
--- a/util.c
+++ b/util.c
@@ -930,6 +930,27 @@ Perl_foldEQ(const char *s1, const char *s2, register I32 len)
     }
     return 1;
 }
+I32
+Perl_foldEQ_latin1(const char *s1, const char *s2, register I32 len)
+{
+    /* Compare non-utf8 using Unicode (Latin1) semantics.  Does not work on
+     * MICRO_SIGN, LATIN_SMALL_LETTER_SHARP_S, nor
+     * LATIN_SMALL_LETTER_Y_WITH_DIAERESIS, and does not check for these.  Nor
+     * does it check that the strings each have at least 'len' characters */
+
+    register const U8 *a = (const U8 *)s1;
+    register const U8 *b = (const U8 *)s2;
+
+    PERL_ARGS_ASSERT_FOLDEQ_LATIN1;
+
+    while (len--) {
+	if (*a != *b && *a != PL_fold_latin1[*b]) {
+	    return 0;
+	}
+	a++, b++;
+    }
+    return 1;
+}
 
 /*
 =for apidoc foldEQ_locale

--
Perl5 Master Repository



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About