develooper Front page | perl.perl5.porters | Postings from October 2018

[perl #133535] B API for aux_list/OP_MULTICONCAT does not return thelast segment when plain & utf8 representations are different

From:
Tony Cook via RT
Date:
October 11, 2018 02:55
Subject:
[perl #133535] B API for aux_list/OP_MULTICONCAT does not return thelast segment when plain & utf8 representations are different
Message ID:
rt-4.0.24-4910-1539226502-579.133535-15-0@perl.org
On Thu, 20 Sep 2018 09:57:52 -0700, atoomic wrote:
> I noticed this while using B API with op/substr.t to compile it using
> B::C
> with Perl 5.28.0
> 
> From the comment in pp_hot.c we can read that in some cases we can
> have two
> sets of segment lengths
> 
> * * If the string has different plain and utf8 representations
> * (e.g. "\x80"), then then aux[PERL_MULTICONCAT_IX_PLAIN_PV/LEN]]
> * holds the plain rep, while aux[PERL_MULTICONCAT_IX_UTF8_PV/LEN]
> * holds the utf8 rep, and there are 2 sets of segment lengths,
> * with the utf8 set following after the plain set.
> 
> I've the feeling that B API aux_list for multiconcat is missing to
> read the
> last segment in that scenario
> 
> With this simplified version of op/substr.t, it's easier to debug as
> we
> have one single multiconcat op.
> ________________________________________________________________________________
> #!./perl
> 
> print "1..1\n";
> 
> use utf8;
> my $refee = bless [], "\x{100}a";
> my $string = $refee;
> $string = "$string";
> substr $refee, 0, 0, "\xff";
> my $expect = "\xff$string"; # <---- multiconcat
> print "$refee" eq $expect ? "ok 1\n" : "not ok 1\n";
> ________________________________________________________________________________
> 
> 
> While running the program we are going through this code, where
> nargs=1,
> so we are clearly using not the first but the second segment.
> 
> Perl_pp_multiconcat
>    │676   const_lens = aux + PERL_MULTICONCAT_IX_LENGTHS; │
>    │677
>    │678   if (dst_utf8) { │
>    │679       const_pv = aux[PERL_MULTICONCAT_IX_UTF8_PV].pv; │
>    │680       if (   aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv │
>    │681   && const_pv != aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv) │
>    │682   /* separate sets of lengths for plain and utf8 */ │
> > │683   const_lens += nargs + 1;
> 
> Here is a look at aux
> 
> # ----- dump of aux from Perl_pp_multiconcat
> # header
> aux = aux[0] = 1
> aux[1] = \377
> aux[2] = 1
> aux[3] = "ÿ",
> aux[4] = 2
> 
> # first element
> aux[5] 1    # <---- const_lens
> aux[6] -1
> # second segment which was not returned by B::API
> aux[7] 2
> aux[8] -1
> 
> 
> Not exactly sure if adding such a rule is good enough but this is
> fixing
> the cases
> where before that we would only read the first segment
> 
> # Suggested patch to B API for aux_list/OP_MULTICONCAT
> if (
> aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv
>     && aux[PERL_MULTICONCAT_IX_UTF8_PV].pv
>     && aux[PERL_MULTICONCAT_IX_UTF8_PV].pv !=
> aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv ) {
> # read the additional segment
> nargs += 2;
> }

Considering aux_list() code for OP_MULTICONCAT turns the offsets into character rather than byte offsets, won't the 2 from:

> aux[7] 2
> aux[8] -1

be converted into a 1, making it the same as the first segment?

I don't know what extra useful information you would get from this change.

Tony

---
via perlbug:  queue: perl5 status: new
https://rt.perl.org/Ticket/Display.html?id=133535



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About