Front page | perl.perl5.porters |
Postings from March 2012
Re: [perl #107008] UTF8 patches for 5.16
Thread Previous
|
Thread Next
From:
Brian Fraser
Date:
March 21, 2012 18:48
Subject:
Re: [perl #107008] UTF8 patches for 5.16
Message ID:
CA+nL+nb_zJkqz8h+kDUOnOVMwhbrdQMKQq24MQimr0L-Q-Z8HQ@mail.gmail.com
On Wed, Mar 21, 2012 at 10:04 PM, Father Chrysostomos via RT <
perlbug-followup@perl.org> wrote:
> On Sun Dec 25 14:33:33 2011, sprout wrote:
> > Brian Fraser’s work on UTF8 support was never fully integrated into
> > blead. I’m creating this ticket, so I can make it a blocker for
> > 5.16.
> >
> > I really do think it should be a blocker, because without the last few
> > patches it becomes painful to list the exceptions in perldelta.
> >
> > The patches can be found here: https://github.com/Hugmeir/gsoc-pad-
> > utf8-safety/commits/tokemess
> >
> > Everything before ‘Label cleanup’ has been applied already.
> >
> > The label patch has been modified to account for the unicode_eval
> > feature and is on the sprout/tokemess branch at perl5.git.perl.org.
> > It is producing one failure that neither Brian Fraser nor I can
> > fully understand. We are awaiting feedback from Zefram, since it
> > is in code that he wrote. (Why is parse_label trying to take
> > ownership of the string passed to it? Why is it assuming it occurs
> > at the start of a mallocked block?)
>
> I’m sorry I’ve done nothing about this lately. I was planning to work
> on it the past two months, but various things in Real Life got in the
> way, and when one thing finished and I thought I was going to get time,
> something else came up.
>
It's like we are living the same life : O
>
> I still don’t understand why parse_label is causing problems. Zefram,
> can you look into that?
>
For a quick recap, this is how Father C spotted this:
$ PERL_DESTRUCT_LEVEL=2 ./perl -Ilib ext/XS-APItest/t/swaplabel.t
1..56
ok 1
... truncated ...
ok 56
panic: free from wrong pool during global destruction.
The problem is in these lines of parse_label:
lsv = newSV_type(SVt_PV);
SvPV_set(lsv, lpv);
SvCUR_set(lsv, llen);
And it goes away if you do this:
lsv = newSV_type(SVt_PV);
SvPV_set(lsv, pl_yylval.opval->op_type == OP_CONST ? savepv(lpv) : lpv);
SvCUR_set(lsv, llen);
But I don't have a clue why! So that's definitely not ready for inclusion.
I dimly recall that the normal label parsing (with normal, I mean the one
that doesn't use parse_label) takes ownership of the SV's PV slot, but I'm
nowhere near knowledgeable enoughto figure out if that applies to
parse_label too, or why it only shows up with PERL_DESTRUCT_LEVEL=2.
> I’ve begun work on the patches following that. I need to ask Brian
> Fraser about
> <
> https://github.com/Hugmeir/gsoc-pad-utf8-safety/commit/4929775f6218457e97e9c11e8b1fcfce20b6316f
> >.
> Unicode delimiters are not going to go into 5.16, so this patch is
> unnecessary for now, right?
>
I would say throw it out, yes. The one thing it might have fixed
(unintendedly, and definitely without tests) is the error message in
use open qw( :utf8 :std );
use utf8;
use 5.014;
eval "q\x{FF01}asdasd\n\x{FF01}say 1";
say $@;
But that can be properly addressed later.
> I’m attaching the patch here, so it isn’t lost.
>
> --
>
> Father Chrysostomos
>
>
> ---
> via perlbug: queue: perl5 status: new
> https://rt.perl.org:443/rt3/Ticket/Display.html?id=107008
>
> From 4929775f6218457e97e9c11e8b1fcfce20b6316f Mon Sep 17 00:00:00 2001
> From: Brian Fraser <fraserbn@gmail.com>
> Date: Sat, 20 Aug 2011 10:30:32 +0100
> Subject: [PATCH] toke.c and parser.h: Make multi_(open|close) UVs instead
> of
> chars.
>
> This changes the type of PL_parser->multi_(open|close) to UV from
> char; Currently this does nothing much, but if needed be,
> it allows the implementation of UTF-8 paired delimiters.
> ---
> parser.h | 4 +-
> toke.c | 115
> ++++++++++++++++++++++++++++++++++++++++++++------------------
> 2 files changed, 84 insertions(+), 35 deletions(-)
>
> diff --git a/parser.h b/parser.h
> index bbf3bf3..cdef065 100644
> --- a/parser.h
> +++ b/parser.h
> @@ -57,8 +57,8 @@
> SV *lex_stuff; /* runtime pattern from m// or s/// */
> I32 multi_start; /* 1st line of multi-line string */
> I32 multi_end; /* last line of multi-line string */
> - char multi_open; /* delimiter of said string */
> - char multi_close; /* delimiter of said string */
> + UV multi_open; /* delimiter of said string */
> + UV multi_close; /* delimiter of said string */
> char pending_ident; /* pending identifier lookup */
> bool preambled;
> I32 lex_allbrackets;/* (), [], {}, ?: bracket count */
> diff --git a/toke.c b/toke.c
> index 5ce02f8..aaac904 100644
> --- a/toke.c
> +++ b/toke.c
> @@ -570,23 +570,27 @@ enum token_type {
> S_missingterm(pTHX_ char *s, STRLEN len)
> {
> dVAR;
> - char tmpbuf[3];
> + char tmpbuf[UTF8_MAXBYTES + 1];
> char q;
> if (s) {
> char * const nl = strrchr(s,'\n');
> if (nl)
> *nl = '\0';
> }
> - else if (isCNTRL(PL_multi_close)) {
> + else if (UNI_IS_INVARIANT(PL_multi_close) &&
> isCNTRL((char)PL_multi_close)) {
> *tmpbuf = '^';
> - tmpbuf[1] = (char)toCTRL(PL_multi_close);
> - tmpbuf[2] = '\0';
> + tmpbuf[1] = (char)toCTRL((char)PL_multi_close);
> + len = 2;
> s = tmpbuf;
> }
> else {
> - *tmpbuf = (char)PL_multi_close;
> - tmpbuf[1] = '\0';
> - s = tmpbuf;
> + if (UTF)
> + len = uvchr_to_utf8((U8*)tmpbuf, PL_multi_close) - (U8*)tmpbuf;
> + else {
> + *tmpbuf = (char)PL_multi_close;
> + len = 1;
> + }
> + s = tmpbuf;
> }
> q = strchr(s,'"') ? '\'' : '"';
> Perl_croak(aTHX_ "Can't find string terminator %c%"SVf"%c anywhere
> before EOF",
> @@ -9001,7 +9005,7 @@ If the next character is in (or extends into) the
> next chunk of input
> }
>
> pm = (PMOP*)newPMOP(type, 0);
> - if (PL_multi_open == '?') {
> + if (UNI_IS_INVARIANT(PL_multi_open) && (char)PL_multi_open == '?') {
> /* This is the only point in the code that sets PMf_ONCE: */
> pm->op_pmflags |= PMf_ONCE;
>
> @@ -9067,8 +9071,16 @@ If the next character is in (or extends into) the
> next chunk of input
> if (!s)
> Perl_croak(aTHX_ "Substitution pattern not terminated");
>
> - if (s[-1] == PL_multi_open)
> - s--;
> + if ( UNI_IS_INVARIANT(PL_multi_open) ) {
> + if (s[-1] == (char)PL_multi_open)
> + s--;
> + }
> + else {
> + char tmpbuf[UTF8_MAXBYTES + 1];
> + STRLEN tmplen = uvchr_to_utf8((U8*)tmpbuf, PL_multi_open) -
> (U8*)tmpbuf;
> + if ( memEQ(s-tmplen, tmpbuf, tmplen) )
> + s -= tmplen;
> + }
> #ifdef PERL_MAD
> if (PL_madskills) {
> CURMAD('q', PL_thisopen);
> @@ -9175,8 +9187,16 @@ If the next character is in (or extends into) the
> next chunk of input
> if (!s)
> Perl_croak(aTHX_ "Transliteration pattern not terminated");
>
> - if (s[-1] == PL_multi_open)
> - s--;
> + if ( UNI_IS_INVARIANT(PL_multi_open) ) {
> + if (s[-1] == (char)PL_multi_open)
> + s--;
> + }
> + else {
> + char tmpbuf[UTF8_MAXBYTES + 1];
> + STRLEN tmplen = uvchr_to_utf8((U8*)tmpbuf, PL_multi_open) -
> (U8*)tmpbuf;
> + if ( memEQ(s-tmplen, tmpbuf, tmplen) )
> + s -= tmplen;
> + }
> #ifdef PERL_MAD
> if (PL_madskills) {
> CURMAD('q', PL_thisopen);
> @@ -9729,12 +9749,11 @@ If the next character is in (or extends into) the
> next chunk of input
> SV *sv; /* scalar value: string */
> const char *tmps; /* temp string, used for delimiter
> matching */
> register char *s = start; /* current position in the buffer */
> - register char term; /* terminating character */
> register char *to; /* current position in the sv's
> data */
> + register UV term; /* terminating character */
> I32 brackets = 1; /* bracket nesting level */
> bool has_utf8 = FALSE; /* is there any utf8 content? */
> - I32 termcode; /* terminating char. code */
> - U8 termstr[UTF8_MAXBYTES]; /* terminating string */
> + U8 termstr[UTF8_MAXBYTES + 1]; /* terminating string */
> STRLEN termlen; /* length of terminating string */
> int last_off = 0; /* last position for nesting
> bracket */
> #ifdef PERL_MAD
> @@ -9761,15 +9780,14 @@ If the next character is in (or extends into) the
> next chunk of input
> CLINE;
>
> /* after skipping whitespace, the next character is the terminator */
> - term = *s;
> if (!UTF) {
> - termcode = termstr[0] = term;
> + termstr[0] = term = *s;
> termlen = 1;
> }
> else {
> - termcode = utf8_to_uvchr((U8*)s, &termlen);
> + term = utf8_to_uvchr((U8*)s, &termlen);
> Copy(s, termstr, termlen, U8);
> - if (!UTF8_IS_INVARIANT(term))
> + if (!UNI_IS_INVARIANT(term))
> has_utf8 = TRUE;
> }
>
> @@ -9778,8 +9796,15 @@ If the next character is in (or extends into) the
> next chunk of input
> PL_multi_open = term;
>
> /* find corresponding closing delimiter */
> - if (term && (tmps = strchr("([{< )]}> )]}>",term)))
> - termcode = termstr[0] = term = tmps[5];
> + /* XXX: this strchr() and the assignment to term would have to be
> + replaced by a function to get UTF-8 delimiters */
> + if (term && !has_utf8 && (tmps = strchr("([{< )]}> )]}>",term))) {
> + term = tmps[5];
> + if ( has_utf8 )
> + uvchr_to_utf8((U8*)termstr, term);
> + else
> + termstr[0] = term;
> + }
>
> PL_multi_close = term;
>
> @@ -9787,7 +9812,7 @@ If the next character is in (or extends into) the
> next chunk of input
> What a random number. */
> sv = newSV_type(SVt_PVIV);
> SvGROW(sv, 80);
> - SvIV_set(sv, termcode);
> + SvIV_set(sv, term);
> (void)SvPOK_only(sv); /* validate pointer */
>
> /* move past delimiter and try to read a complete string */
> @@ -9843,12 +9868,12 @@ If the next character is in (or extends into) the
> next chunk of input
> /* At here, all closes are "was quoted" one,
> so we don't check PL_multi_close. */
> if (*t == '\\') {
> - if (!keep_quoted && *(t+1) ==
> PL_multi_open)
> + if (!keep_quoted && *(t+1) ==
> (char)PL_multi_open)
> t++;
> else
> *w++ = *t++;
> }
> - else if (*t == PL_multi_open)
> + else if (*t == (char)PL_multi_open)
> brackets++;
>
> *w = *t;
> @@ -9878,13 +9903,13 @@ If the next character is in (or extends into) the
> next chunk of input
>
> /* if open delimiter is the close delimiter read unbridle */
> if (PL_multi_open == PL_multi_close) {
> - for (; s < PL_bufend; s++,to++) {
> + while (s < PL_bufend ) {
> /* embedded newlines increment the current line number */
> if (*s == '\n' && !PL_rsfp)
> CopLINE_inc(PL_curcop);
> /* handle quoted delimiters */
> if (*s == '\\' && s+1 < PL_bufend && term != '\\') {
> - if (!keep_quoted && s[1] == term)
> + if (!keep_quoted && s[1] == (char)*termstr)
> s++;
> /* any other quotes are simply copied straight through */
> else
> @@ -9892,7 +9917,7 @@ If the next character is in (or extends into) the
> next chunk of input
> }
> /* terminate when run out of buffer (the for() condition),
> or
> have found the terminator */
> - else if (*s == term) {
> + else if (*s == (char)*termstr) {
> if (termlen == 1)
> break;
> if (s+termlen <= PL_bufend && memEQ(s, (char*)termstr,
> termlen))
> @@ -9900,6 +9925,14 @@ If the next character is in (or extends into) the
> next chunk of input
> }
> else if (!has_utf8 && !UTF8_IS_INVARIANT((U8)*s) && UTF)
> has_utf8 = TRUE;
> +
> + if (has_utf8) {
> + STRLEN skiplen = UTF8SKIP(s);
> + while (skiplen--)
> + *to++ = *s++;
> + }
> + else
> + *to++ = *s++;
> *to = *s;
> }
> }
> @@ -9910,25 +9943,36 @@ If the next character is in (or extends into) the
> next chunk of input
> */
> else {
> /* read until we run out of string, or we find the terminator */
> - for (; s < PL_bufend; s++,to++) {
> + while ( s < PL_bufend ) {
> /* embedded newlines increment the line count */
> if (*s == '\n' && !PL_rsfp)
> CopLINE_inc(PL_curcop);
> /* backslashes can escape the open or closing characters */
> if (*s == '\\' && s+1 < PL_bufend) {
> + const UV next_chr = has_utf8
> + ? utf8_to_uvchr((U8*)(s+1),
> NULL)
> + : (UV)(s[1]);
> if (!keep_quoted &&
> - ((s[1] == PL_multi_open) || (s[1] ==
> PL_multi_close)))
> + (next_chr == PL_multi_open || next_chr ==
> PL_multi_close) )
> s++;
> else
> *to++ = *s++;
> }
> /* allow nested opens and closes */
> - else if (*s == PL_multi_close && --brackets <= 0)
> + else if ((has_utf8 ? utf8_to_uvchr((U8*)s, NULL) ==
> PL_multi_close : (char)PL_multi_close == *s) && --brackets <= 0)
> break;
> - else if (*s == PL_multi_open)
> + else if (has_utf8 ? utf8_to_uvchr((U8*)s, NULL) ==
> PL_multi_open : PL_multi_open == (UV)*s)
> brackets++;
> else if (!has_utf8 && !UTF8_IS_INVARIANT((U8)*s) && UTF)
> has_utf8 = TRUE;
> +
> + if (has_utf8) {
> + STRLEN skiplen = UTF8SKIP(s);
> + while (skiplen--)
> + *to++ = *s++;
> + }
> + else
> + *to++ = *s++;
> *to = *s;
> }
> }
> @@ -10716,8 +10760,13 @@ If the next character is in (or extends into) the
> next chunk of input
> Perl_sv_catpvf(aTHX_ msg, "%"SVf"\n", SVfARG(where_sv));
> if (PL_multi_start < PL_multi_end && (U32)(CopLINE(PL_curcop) -
> PL_multi_end) <= 1) {
> Perl_sv_catpvf(aTHX_ msg,
> - " (Might be a runaway multi-line %c%c string starting on line
> %"IVdf")\n",
> -
> (int)PL_multi_open,(int)PL_multi_close,(IV)PL_multi_start);
> + " (Might be a runaway multi-line %s string starting on line
> %"IVdf")\n",
> + ((UNI_IS_INVARIANT(PL_multi_open) &&
> UNI_IS_INVARIANT(PL_multi_close))
> + ? Perl_form(aTHX_ "%c%c",
> + (char)PL_multi_open, (char)PL_multi_close)
> + : Perl_form(aTHX_ "\\x{%02X}\\x{%02X}",
> + (unsigned int)PL_multi_open, (unsigned
> int)PL_multi_close)),
> + (IV)PL_multi_start);
> PL_multi_end = 0;
> }
> if (PL_in_eval & EVAL_WARNONLY) {
> --
> 1.7.5.4
>
>
>
Thread Previous
|
Thread Next