develooper Front page | perl.perl5.porters | Postings from March 2012

Re: [perl #107008] UTF8 patches for 5.16

Thread Previous | Thread Next
From:
Brian Fraser
Date:
March 21, 2012 18:48
Subject:
Re: [perl #107008] UTF8 patches for 5.16
Message ID:
CA+nL+nb_zJkqz8h+kDUOnOVMwhbrdQMKQq24MQimr0L-Q-Z8HQ@mail.gmail.com
On Wed, Mar 21, 2012 at 10:04 PM, Father Chrysostomos via RT <
perlbug-followup@perl.org> wrote:

> On Sun Dec 25 14:33:33 2011, sprout wrote:
> > Brian Fraser’s work on UTF8 support was never fully integrated into
> >    blead.  I’m creating this ticket, so I can make it a blocker for
> >    5.16.
> >
> > I really do think it should be a blocker, because without the last few
> >    patches it becomes painful to list the exceptions in perldelta.
> >
> > The patches can be found here: https://github.com/Hugmeir/gsoc-pad-
> >    utf8-safety/commits/tokemess
> >
> > Everything before ‘Label cleanup’ has been applied already.
> >
> > The label patch has been modified to account for the unicode_eval
> >    feature and is on the sprout/tokemess branch at perl5.git.perl.org.
> >    It is producing one failure that neither Brian Fraser nor I can
> >    fully understand.  We are awaiting feedback from Zefram, since it
> >    is in code that he wrote.  (Why is parse_label trying to take
> >    ownership of the string passed to it?  Why is it assuming it occurs
> >    at the start of a mallocked block?)
>
> I’m sorry I’ve done nothing about this lately.  I was planning to work
> on it the past two months, but various things in Real Life got in the
> way, and when one thing finished and I thought I was going to get time,
> something else came up.
>

It's like we are living the same life : O


>
> I still don’t understand why parse_label is causing problems.  Zefram,
> can you look into that?
>

For a quick recap, this is how Father C spotted this:

$ PERL_DESTRUCT_LEVEL=2 ./perl -Ilib ext/XS-APItest/t/swaplabel.t
1..56
ok 1
... truncated ...
ok 56
panic: free from wrong pool during global destruction.

The problem is in these lines of parse_label:

lsv = newSV_type(SVt_PV);
SvPV_set(lsv, lpv);
SvCUR_set(lsv, llen);

And it goes away if you do this:

lsv = newSV_type(SVt_PV);
SvPV_set(lsv, pl_yylval.opval->op_type == OP_CONST ? savepv(lpv) : lpv);
SvCUR_set(lsv, llen);

But I don't have a clue why! So that's definitely not ready for inclusion.
I dimly recall that the normal label parsing (with normal, I mean the one
that doesn't use parse_label) takes ownership of the SV's PV slot, but I'm
nowhere near knowledgeable enoughto figure out if that applies to
parse_label too, or why it only shows up with PERL_DESTRUCT_LEVEL=2.


> I’ve begun work on the patches following that.  I need to ask Brian
> Fraser about
> <
> https://github.com/Hugmeir/gsoc-pad-utf8-safety/commit/4929775f6218457e97e9c11e8b1fcfce20b6316f
> >.
>  Unicode delimiters are not going to go into 5.16, so this patch is
> unnecessary for now, right?
>

I would say throw it out, yes. The one thing it might have fixed
(unintendedly, and definitely without tests) is the error message in

use open qw( :utf8 :std );
use utf8;
use 5.014;

eval "q\x{FF01}asdasd\n\x{FF01}say 1";
say $@;

But that can be properly addressed later.


> I’m attaching the patch here, so it isn’t lost.
>
> --
>
> Father Chrysostomos
>
>
> ---
> via perlbug:  queue: perl5 status: new
> https://rt.perl.org:443/rt3/Ticket/Display.html?id=107008
>
> From 4929775f6218457e97e9c11e8b1fcfce20b6316f Mon Sep 17 00:00:00 2001
> From: Brian Fraser <fraserbn@gmail.com>
> Date: Sat, 20 Aug 2011 10:30:32 +0100
> Subject: [PATCH] toke.c and parser.h: Make multi_(open|close) UVs instead
> of
>  chars.
>
> This changes the type of PL_parser->multi_(open|close) to UV from
> char; Currently this does nothing much, but if needed be,
> it allows the implementation of UTF-8 paired delimiters.
> ---
>  parser.h |    4 +-
>  toke.c   |  115
> ++++++++++++++++++++++++++++++++++++++++++++------------------
>  2 files changed, 84 insertions(+), 35 deletions(-)
>
> diff --git a/parser.h b/parser.h
> index bbf3bf3..cdef065 100644
> --- a/parser.h
> +++ b/parser.h
> @@ -57,8 +57,8 @@
>     SV         *lex_stuff;     /* runtime pattern from m// or s/// */
>     I32                multi_start;    /* 1st line of multi-line string */
>     I32                multi_end;      /* last line of multi-line string */
> -    char       multi_open;     /* delimiter of said string */
> -    char       multi_close;    /* delimiter of said string */
> +    UV         multi_open;     /* delimiter of said string */
> +    UV         multi_close;    /* delimiter of said string */
>     char       pending_ident;  /* pending identifier lookup */
>     bool       preambled;
>     I32                lex_allbrackets;/* (), [], {}, ?: bracket count */
> diff --git a/toke.c b/toke.c
> index 5ce02f8..aaac904 100644
> --- a/toke.c
> +++ b/toke.c
> @@ -570,23 +570,27 @@ enum token_type {
>  S_missingterm(pTHX_ char *s, STRLEN len)
>  {
>     dVAR;
> -    char tmpbuf[3];
> +    char tmpbuf[UTF8_MAXBYTES + 1];
>     char q;
>     if (s) {
>        char * const nl = strrchr(s,'\n');
>        if (nl)
>            *nl = '\0';
>     }
> -    else if (isCNTRL(PL_multi_close)) {
> +    else if (UNI_IS_INVARIANT(PL_multi_close) &&
> isCNTRL((char)PL_multi_close)) {
>        *tmpbuf = '^';
> -       tmpbuf[1] = (char)toCTRL(PL_multi_close);
> -       tmpbuf[2] = '\0';
> +       tmpbuf[1] = (char)toCTRL((char)PL_multi_close);
> +       len = 2;
>        s = tmpbuf;
>     }
>     else {
> -       *tmpbuf = (char)PL_multi_close;
> -       tmpbuf[1] = '\0';
> -       s = tmpbuf;
> +        if (UTF)
> +          len = uvchr_to_utf8((U8*)tmpbuf, PL_multi_close) - (U8*)tmpbuf;
> +        else {
> +            *tmpbuf = (char)PL_multi_close;
> +            len = 1;
> +        }
> +        s = tmpbuf;
>     }
>     q = strchr(s,'"') ? '\'' : '"';
>     Perl_croak(aTHX_ "Can't find string terminator %c%"SVf"%c anywhere
> before EOF",
> @@ -9001,7 +9005,7 @@ If the next character is in (or extends into) the
> next chunk of input
>     }
>
>     pm = (PMOP*)newPMOP(type, 0);
> -    if (PL_multi_open == '?') {
> +    if (UNI_IS_INVARIANT(PL_multi_open) && (char)PL_multi_open == '?') {
>        /* This is the only point in the code that sets PMf_ONCE:  */
>        pm->op_pmflags |= PMf_ONCE;
>
> @@ -9067,8 +9071,16 @@ If the next character is in (or extends into) the
> next chunk of input
>     if (!s)
>        Perl_croak(aTHX_ "Substitution pattern not terminated");
>
> -    if (s[-1] == PL_multi_open)
> -       s--;
> +    if ( UNI_IS_INVARIANT(PL_multi_open) ) {
> +        if (s[-1] == (char)PL_multi_open)
> +            s--;
> +    }
> +    else {
> +        char tmpbuf[UTF8_MAXBYTES + 1];
> +        STRLEN tmplen = uvchr_to_utf8((U8*)tmpbuf, PL_multi_open) -
> (U8*)tmpbuf;
> +        if ( memEQ(s-tmplen, tmpbuf, tmplen) )
> +            s -= tmplen;
> +    }
>  #ifdef PERL_MAD
>     if (PL_madskills) {
>        CURMAD('q', PL_thisopen);
> @@ -9175,8 +9187,16 @@ If the next character is in (or extends into) the
> next chunk of input
>     if (!s)
>        Perl_croak(aTHX_ "Transliteration pattern not terminated");
>
> -    if (s[-1] == PL_multi_open)
> -       s--;
> +    if ( UNI_IS_INVARIANT(PL_multi_open) ) {
> +        if (s[-1] == (char)PL_multi_open)
> +            s--;
> +    }
> +    else {
> +        char tmpbuf[UTF8_MAXBYTES + 1];
> +        STRLEN tmplen = uvchr_to_utf8((U8*)tmpbuf, PL_multi_open) -
> (U8*)tmpbuf;
> +        if ( memEQ(s-tmplen, tmpbuf, tmplen) )
> +            s -= tmplen;
> +    }
>  #ifdef PERL_MAD
>     if (PL_madskills) {
>        CURMAD('q', PL_thisopen);
> @@ -9729,12 +9749,11 @@ If the next character is in (or extends into) the
> next chunk of input
>     SV *sv;                            /* scalar value: string */
>     const char *tmps;                  /* temp string, used for delimiter
> matching */
>     register char *s = start;          /* current position in the buffer */
> -    register char term;                        /* terminating character */
>     register char *to;                 /* current position in the sv's
> data */
> +    register UV term;                  /* terminating character */
>     I32 brackets = 1;                  /* bracket nesting level */
>     bool has_utf8 = FALSE;             /* is there any utf8 content? */
> -    I32 termcode;                      /* terminating char. code */
> -    U8 termstr[UTF8_MAXBYTES];         /* terminating string */
> +    U8 termstr[UTF8_MAXBYTES + 1];     /* terminating string */
>     STRLEN termlen;                    /* length of terminating string */
>     int last_off = 0;                  /* last position for nesting
> bracket */
>  #ifdef PERL_MAD
> @@ -9761,15 +9780,14 @@ If the next character is in (or extends into) the
> next chunk of input
>     CLINE;
>
>     /* after skipping whitespace, the next character is the terminator */
> -    term = *s;
>     if (!UTF) {
> -       termcode = termstr[0] = term;
> +       termstr[0] = term = *s;
>        termlen = 1;
>     }
>     else {
> -       termcode = utf8_to_uvchr((U8*)s, &termlen);
> +       term = utf8_to_uvchr((U8*)s, &termlen);
>        Copy(s, termstr, termlen, U8);
> -       if (!UTF8_IS_INVARIANT(term))
> +       if (!UNI_IS_INVARIANT(term))
>            has_utf8 = TRUE;
>     }
>
> @@ -9778,8 +9796,15 @@ If the next character is in (or extends into) the
> next chunk of input
>     PL_multi_open = term;
>
>     /* find corresponding closing delimiter */
> -    if (term && (tmps = strchr("([{< )]}> )]}>",term)))
> -       termcode = termstr[0] = term = tmps[5];
> +    /* XXX: this strchr() and the assignment to term would have to be
> +       replaced by a function to get UTF-8 delimiters */
> +    if (term && !has_utf8 && (tmps = strchr("([{< )]}> )]}>",term))) {
> +       term = tmps[5];
> +        if ( has_utf8 )
> +            uvchr_to_utf8((U8*)termstr, term);
> +        else
> +            termstr[0] = term;
> +    }
>
>     PL_multi_close = term;
>
> @@ -9787,7 +9812,7 @@ If the next character is in (or extends into) the
> next chunk of input
>        What a random number. */
>     sv = newSV_type(SVt_PVIV);
>     SvGROW(sv, 80);
> -    SvIV_set(sv, termcode);
> +    SvIV_set(sv, term);
>     (void)SvPOK_only(sv);              /* validate pointer */
>
>     /* move past delimiter and try to read a complete string */
> @@ -9843,12 +9868,12 @@ If the next character is in (or extends into) the
> next chunk of input
>                            /* At here, all closes are "was quoted" one,
>                               so we don't check PL_multi_close. */
>                            if (*t == '\\') {
> -                               if (!keep_quoted && *(t+1) ==
> PL_multi_open)
> +                               if (!keep_quoted && *(t+1) ==
> (char)PL_multi_open)
>                                    t++;
>                                else
>                                    *w++ = *t++;
>                            }
> -                           else if (*t == PL_multi_open)
> +                           else if (*t == (char)PL_multi_open)
>                                brackets++;
>
>                            *w = *t;
> @@ -9878,13 +9903,13 @@ If the next character is in (or extends into) the
> next chunk of input
>
>        /* if open delimiter is the close delimiter read unbridle */
>        if (PL_multi_open == PL_multi_close) {
> -           for (; s < PL_bufend; s++,to++) {
> +           while (s < PL_bufend ) {
>                /* embedded newlines increment the current line number */
>                if (*s == '\n' && !PL_rsfp)
>                    CopLINE_inc(PL_curcop);
>                /* handle quoted delimiters */
>                if (*s == '\\' && s+1 < PL_bufend && term != '\\') {
> -                   if (!keep_quoted && s[1] == term)
> +                   if (!keep_quoted && s[1] == (char)*termstr)
>                        s++;
>                /* any other quotes are simply copied straight through */
>                    else
> @@ -9892,7 +9917,7 @@ If the next character is in (or extends into) the
> next chunk of input
>                }
>                /* terminate when run out of buffer (the for() condition),
> or
>                   have found the terminator */
> -               else if (*s == term) {
> +               else if (*s == (char)*termstr) {
>                    if (termlen == 1)
>                        break;
>                    if (s+termlen <= PL_bufend && memEQ(s, (char*)termstr,
> termlen))
> @@ -9900,6 +9925,14 @@ If the next character is in (or extends into) the
> next chunk of input
>                }
>                else if (!has_utf8 && !UTF8_IS_INVARIANT((U8)*s) && UTF)
>                    has_utf8 = TRUE;
> +
> +                if (has_utf8) {
> +                    STRLEN skiplen = UTF8SKIP(s);
> +                    while (skiplen--)
> +                         *to++ = *s++;
> +                }
> +                else
> +                    *to++ = *s++;
>                *to = *s;
>            }
>        }
> @@ -9910,25 +9943,36 @@ If the next character is in (or extends into) the
> next chunk of input
>        */
>        else {
>            /* read until we run out of string, or we find the terminator */
> -           for (; s < PL_bufend; s++,to++) {
> +           while ( s < PL_bufend ) {
>                /* embedded newlines increment the line count */
>                if (*s == '\n' && !PL_rsfp)
>                    CopLINE_inc(PL_curcop);
>                /* backslashes can escape the open or closing characters */
>                if (*s == '\\' && s+1 < PL_bufend) {
> +                    const UV next_chr = has_utf8
> +                                            ? utf8_to_uvchr((U8*)(s+1),
> NULL)
> +                                            : (UV)(s[1]);
>                    if (!keep_quoted &&
> -                       ((s[1] == PL_multi_open) || (s[1] ==
> PL_multi_close)))
> +                            (next_chr == PL_multi_open || next_chr ==
> PL_multi_close) )
>                        s++;
>                    else
>                        *to++ = *s++;
>                }
>                /* allow nested opens and closes */
> -               else if (*s == PL_multi_close && --brackets <= 0)
> +               else if ((has_utf8 ? utf8_to_uvchr((U8*)s, NULL) ==
> PL_multi_close : (char)PL_multi_close == *s) && --brackets <= 0)
>                    break;
> -               else if (*s == PL_multi_open)
> +               else if (has_utf8 ? utf8_to_uvchr((U8*)s, NULL) ==
> PL_multi_open : PL_multi_open == (UV)*s)
>                    brackets++;
>                else if (!has_utf8 && !UTF8_IS_INVARIANT((U8)*s) && UTF)
>                    has_utf8 = TRUE;
> +
> +                if (has_utf8) {
> +                    STRLEN skiplen = UTF8SKIP(s);
> +                    while (skiplen--)
> +                         *to++ = *s++;
> +                }
> +                else
> +                    *to++ = *s++;
>                *to = *s;
>            }
>        }
> @@ -10716,8 +10760,13 @@ If the next character is in (or extends into) the
> next chunk of input
>        Perl_sv_catpvf(aTHX_ msg, "%"SVf"\n", SVfARG(where_sv));
>     if (PL_multi_start < PL_multi_end && (U32)(CopLINE(PL_curcop) -
> PL_multi_end) <= 1) {
>         Perl_sv_catpvf(aTHX_ msg,
> -        "  (Might be a runaway multi-line %c%c string starting on line
> %"IVdf")\n",
> -
>  (int)PL_multi_open,(int)PL_multi_close,(IV)PL_multi_start);
> +            "  (Might be a runaway multi-line %s string starting on line
> %"IVdf")\n",
> +                ((UNI_IS_INVARIANT(PL_multi_open) &&
> UNI_IS_INVARIANT(PL_multi_close))
> +                    ? Perl_form(aTHX_ "%c%c",
> +                        (char)PL_multi_open, (char)PL_multi_close)
> +                    : Perl_form(aTHX_ "\\x{%02X}\\x{%02X}",
> +                        (unsigned int)PL_multi_open, (unsigned
> int)PL_multi_close)),
> +                (IV)PL_multi_start);
>         PL_multi_end = 0;
>     }
>     if (PL_in_eval & EVAL_WARNONLY) {
> --
> 1.7.5.4
>
>
>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About