develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Marvin Humphrey
Date:
March 30, 2007 16:07
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
304BBC5F-D8A8-4847-AF2E-E3677EA2B7CD@rectangular.com

On Mar 30, 2007, at 2:25 PM, Juerd Waalboer wrote:
>> That so many users, including those as expert as Marc, possess a
>> "broken" understanding of Perl's Unicode model suggests a flawed
>> design.
>
> I think the design is solid, but the implementation (see regex)  
> slightly
> broken and documentation wildly misleading.

I strongly disagree with this assessment.  In particular, I think  
insisting that the user be responsible for manually segregating  
character and byte-oriented data without any help from Perl is  
totally unreasonable.

Look at how easily Marc made the "mistake" of commingling the two  
types of data.  It's debatable whether the fact that Perl allowed him  
to do that without complaint is a flaw with the design or the  
implementation, but it's one or the other and it's serious.

Additionally, as Marc points out, there are lots of broken XS modules  
out there -- including one of mine. (KinoSearch 0.15 -- Unicode  
support is fixed as of 0.20_01, which breaks backwards  
compatibility.)  Few or none of them would be broken if Perl made it  
more difficult to move between character data and byte-oriented data  
-- errors would be flying right and left and the broken modules would  
get fixed right away.

Of course I understand why that cannot be the case, but it's  
astonishing to me that you see this as a problem which can be solved  
via documentation.

I hope that Perl 6 does not opt to replicate Perl 5's behavior in  
this area (my understanding is that it will not, but I'm not  
following development closely).  I hope that project is taking into  
account the lessons we have learned in the wake of very difficult  
compromises about how to balance the addition of Unicode with  
preserving backwards compatibility.

> Surely you must know a way in which Perl's unicode support can be
> improved, or accidents avoided, without trying to change all of Perl,
> CPAN, and a gazillion lines of code that we can't even reach. Let's  
> hear
> it! :)

How about encouraging the use of encoding::warnings in perlunitut?

How about adding it to core and having 'use 5.10;' turn it on?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About