Front page | perl.perl5.porters |
Postings from March 2007
Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings
Thread Previous
|
Thread Next
From:
demerphq
Date:
March 20, 2007 03:53
Subject:
Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings
Message ID:
9b18b3110703200353k2e84101h80ed842a3602b049@mail.gmail.com
Hello Gentlemen,
Was wondering if either of you had any comments or thoughts on the
attached patches and test files. This matter seems to be warnocked
until one or both of you utf8/unicode/encoding experts give your
feedback...
Cheers,
Yves
On 2/17/07, demerphq <demerphq@gmail.com> wrote:
> On 2/17/07, via RT John Berthels <perlbug-followup@perl.org> wrote:
> > # New Ticket Created by "John Berthels"
> > # Please include the string: [perl #41527]
> > # in the subject line of all future correspondence about this issue.
> > # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=41527 >
> >
> >
> > This is a bug report for perl from jjberthels@gmail.com,
> > generated with the help of perlbug 1.35 running under perl v5.8.8.
> >
> >
> > -----------------------------------------------------------------
> > [Please enter your report here]
> >
> > Hi.
> >
> > The documentation for the 'decode' function in Encode.pm states:
> >
> > ...the utf8 flag for $string is on unless $octets entirely
> > consists of ASCII data...
> >
> > but it appears that decode turns on the flag even if the input string is
> > plain ASCII. A test case demonstrating this is appended below.
> >
> > I understand this doesn't make a difference from a correctness point of
> > view, but it does change the peformance characteristics, presumably due
> > to the use of the unicode regex engine (the profile showed something like
> > SWASHNEW taking a lot of time).
> >
> > An older version of Encode (I believe v 2.01) had the behaviour described
> > in the docs, and would have passed the test case below.
> >
> > In my case, the application is required to process utf8 data correctly,
> > but the vast majority of data is plain ascii. This change in behaviour from
> > 2.01 is causing a noticeable increase in CPU usage.
> >
> > I'm currently working around this with a regexp test /[\x80-\xff]/ on the
> > byte string and avoiding calling Encode::decode in this case, but a quick
> > check on perlmonks led to a suggestion that I raise this as a perlbug:
> > http://perlmonks.org/?node_id=600050 (although opinion was divided on
> > whether this was a bug).
> >
> > I've taken a quick look at the XS and can see an unconditional SvUTF8_on(dst)
> > on line 453. I don't know whether a good fix would be to add an additional
> > loop over the string to check the flag there or keep the 'only loop over
> > the string once' behaviour by passing a "was the string plain ascii" flag
> > back from process_utf8().
>
> I looked into more or less this strategy, but well, im not sure if it works out.
>
> I have to say the code in Encode.* is kinda confusing to this ascii
> type programmer.
>
> > I'll happily try to whip up a patch of either solution if you agree this
> > needs changing and let me know which approach you prefer.
> >
> > regards,
> >
> > jb
> >
> >
> > #!/usr/bin/perl
> > use warnings;
> > use strict;
> > use Test::More (tests => 2);
> >
> > use Encode;
> >
> > my $ascii_bytes = "l\xf8\xf8k - a latin1 string";
> > my $latin1_bytes = "this is plain ascii";
>
> Er, arent these backwards? \xf8 isnt in ascii, its in latin1. ascii is
> a 7 bit encoding.
>
> > my $encoded_str = Encode::decode_utf8($latin1_bytes);
> > ok(Encode::is_utf8($encoded_str),
> > "(check encode is working) non-ascii latin-1 byte string becomes char str");
> >
> > $encoded_str = Encode::decode_utf8($ascii_bytes);
> > ok(! Encode::is_utf8($encoded_str),
> > "but ascii byte string untagged afeter decode");
>
> I changed the code to the attached perl script, encode.pl and I get
> the attached output with perl 5.8.6 encode version 2.09:
>
> D:\dev\perl\ver\zoro\win32>perl encode.pl
> 1..2
> SV = PV(0x15d5914) at 0x1a6c864
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
> PV = 0x15dd674 "l\303\270\303\270k - a latin1 string"\0 [UTF8
> "l\x{f8}\x{f8}k - a latin1 string"]
> CUR = 24
> LEN = 27
> not ok 1 - (check encode is working) non-ascii latin-1 byte string
> becomes char str
> # Failed test '(check encode is working) non-ascii latin-1 byte
> string becomes char str'
> # in encode.pl at line 13.
> SV = NULL(0x0) at 0x1a6c720
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY)
> ----------
> SV = PV(0x1bdb7c4) at 0x1bde2f4
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
> PV = 0x1bf38f4 "this is plain ascii"\0 [UTF8 "this is plain ascii"]
> CUR = 19
> LEN = 22
> ok 2 - but ascii byte string untagged after decode
> SV = PVMG(0x1bda7b4) at 0x1bd7d74
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> IV = 0
> NV = 0
> PV = 0x1bf095c "this is plain ascii"\0
> CUR = 19
> LEN = 20
> # Looks like you failed 1 test of 2.
>
> Note the null return for the unicode string with high byte chars in it.
>
> Now here it is with a blead patch with the attached patch, notice it
> has correct output for both strings and passes the tests:
>
> D:\dev\perl\ver\zoro\win32>..\perl encode.pl
> 1..2
> SV = PV(0x1a46cc4) at 0x1a6802c
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8
> "l\x{f8}\x{f8}k - a latin1 string"]
> CUR = 24
> LEN = 28
> ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str
> SV = PV(0x1b400bc) at 0x1b3bf94
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x1b9cdfc "l\303\270\303\270k - a latin1 string"\0 [UTF8
> "l\x{f8}\x{f8}k - a latin1 string"]
> CUR = 24
> LEN = 28
> ----------
> SV = PV(0x1bb6f1c) at 0x1b68ccc
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x1b6012c "this is plain ascii"\0 [UTF8 "this is plain ascii"]
> CUR = 19
> LEN = 24
> ok 2 - but ascii byte string untagged after decode
> SV = PV(0x1bb6f1c) at 0x1b68bdc
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK)
> PV = 0x1b2985c "this is plain ascii"\0
> CUR = 19
> LEN = 20
>
>
> Now here it is with an unpatched blead:
>
> Everything is up to date. 'nmake test' to run test suite.
> 1..2
> SV = PV(0x1a46cc4) at 0x1a6802c
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8
> "l\x{f8}\x{f8}k - a latin1 string"]
> CUR = 24
> LEN = 28
> ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str
> SV = PVMG(0x1b60b5c) at 0x1b3bf94
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x1b4aa14 "l\303\270\303\270k - a latin1 string"\0 [UTF8
> "l\x{f8}\x{f8}k - a latin1 string"]
> CUR = 24
> LEN = 28
> MAGIC = 0x1b4b544
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 22
> ----------
> SV = PV(0x1bb7084) at 0x1b68ccc
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x1bbf8bc "this is plain ascii"\0 [UTF8 "this is plain ascii"]
> CUR = 19
> LEN = 24
> not ok 2 - but ascii byte string untagged after decode
> # Failed test 'but ascii byte string untagged after decode'
> # at encode.pl line 21.
> SV = PVMG(0x1b60b9c) at 0x1b68bdc
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x1a8d404 "this is plain ascii"\0 [UTF8 "this is plain ascii"]
> CUR = 19
> LEN = 20
> MAGIC = 0x1bb9d94
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 19
> # Looks like you failed 1 test of 2.
>
> Guess why we get this output? Because current decode_utf8 no-ops when
> the input string is already utf8 (contrary to the docs). Remove that
> noop line (line 196 in Encode.pm) and here is what happens:
>
> Everything is up to date. 'nmake test' to run test suite.
> 1..2
> SV = PV(0x1a46cc4) at 0x1a6802c
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8
> "l\x{f8}\x{f8}k - a latin1 string"]
> CUR = 24
> LEN = 28
> ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str
> SV = PV(0x1b400bc) at 0x1b3bf94
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x1b9cdfc "l\357\277\275\357\277\275k - a latin1 string"\0
> [UTF8 "l\x{fffd}\x{fffd}k - a latin1 string"]
> CUR = 26
> LEN = 28
> ----------
> SV = PV(0x1bb6f1c) at 0x1b68ccc
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x1b6012c "this is plain ascii"\0 [UTF8 "this is plain ascii"]
> CUR = 19
> LEN = 24
> not ok 2 - but ascii byte string untagged after decode
> # Failed test 'but ascii byte string untagged after decode'
> # at encode.pl line 21.
> SV = PV(0x1bb6f1c) at 0x1b68bdc
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x1b2985c "this is plain ascii"\0 [UTF8 "this is plain ascii"]
> CUR = 19
> LEN = 20
> # Looks like you failed 1 test of 2.
>
> Notice the \x{fffd}\x{fffd}, which are because the code (line 431 in Encode.xs)
>
> if (SvUTF8(src)) {
> s = utf8_to_bytes(s,&slen);
> if (s) {
> SvCUR_set(src,slen);
> SvUTF8_off(src);
> e = s+slen;
> }
> else {
> croak("Cannot decode string with wide characters");
> }
> }
>
> Which doesnt seem logical, and when tracing the code doesnt work. The
> valid utf8 sequence gets converted to its byte form and then passed to
> utf8n_to_uvuni() which naturally fails to decode it.
>
> I hope this analysis is useful to someone, it seems to me that the
> current behaviour of is wrong, but i dont understand it well enough to
> say for sure.
>
> Cheers,
> Yves
>
> --
> perl -Mre=debug -e "/just|another|perl|hacker/"
>
>
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next