develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
March 30, 2007 17:01
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
Hash: SHA1


On Friday 30 March 2007 23:06:47 Marvin Humphrey wrote:
> On Mar 30, 2007, at 2:25 PM, Juerd Waalboer wrote:
> >> That so many users, including those as expert as Marc, possess a
> >> "broken" understanding of Perl's Unicode model suggests a flawed
> >> design.
> > I think the design is solid, but the implementation (see regex)
> > slightly
> > broken and documentation wildly misleading.
> I strongly disagree with this assessment.  In particular, I think
> insisting that the user be responsible for manually segregating
> character and byte-oriented data without any help from Perl is
> totally unreasonable.
> Look at how easily Marc made the "mistake" of commingling the two
> types of data.  It's debatable whether the fact that Perl allowed him
> to do that without complaint is a flaw with the design or the
> implementation, but it's one or the other and it's serious.
> Additionally, as Marc points out, there are lots of broken XS modules
> out there -- including one of mine. (KinoSearch 0.15 -- Unicode
> support is fixed as of 0.20_01, which breaks backwards
> compatibility.)  Few or none of them would be broken if Perl made it
> more difficult to move between character data and byte-oriented data
> -- errors would be flying right and left and the broken modules would
> get fixed right away.
> Of course I understand why that cannot be the case, but it's
> astonishing to me that you see this as a problem which can be solved
> via documentation.

I think just documenting isn't enough. We do have things like "strict", so 
if the current Perl model doesn't allow you to even detect when you mix the 
wrong kind of data, then we need module/pragma that catches these errors.

Of course warnings::encode exists, but it seems to not be able to 
distinguish between "untagged" data and real ISO-8859-1 strings as Perl 
itself doesn't make this distinction.

> How about encouraging the use of encoding::warnings in perlunitut?
> How about adding it to core and having 'use 5.10;' turn it on?

If I understand correctly, that would not be enough due to the "is this 
binary or really iso-8859-1 encoded data" problem mentioned above.

all the best,


- -- 
 Signed on Sat Mar 31 01:42:47 2007 with key 0x93B84C15.
 View my photo gallery:
 PGP key on or per email.

 "In 1988, Jack Thompson ran against Janet Reno for DA of Dade County:
 Thompson's unique campaign message was that Reno was unfit for the job
 because, as a closeted lesbian with a drinking problem, she was great
 candidate for blackmail by the criminal element. Jack never explained
 why this remained a threat even after he exposed her 'secret'. Reno
 cruised at the polls."

Version: GnuPG v1.4.2 (GNU/Linux)


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About