develooper Front page | perl.perl5.porters | Postings from October 2011

Re: The Unicode Bug, unicode_strings and utf8

Thread Previous
From:
Karl Williamson
Date:
October 26, 2011 19:11
Subject:
Re: The Unicode Bug, unicode_strings and utf8
Message ID:
4EA8BDCB.4050705@khwilliamson.com
On 10/23/2011 03:14 PM, Mons Anderson wrote:
> I've looked through recent changes and discussions about unicode_strings
> and "The Unicode Bug". I've also looked through the history of
> introduction of UTF8 flag.
>
> Let's compare feature unicode_strings with use utf8
>
> When we use pragma utf8, we implicitly raise the UTF8 flag on strings,
> that are not written as "\x{...}"
> Except those in range 7f-ff.
> I.e. it's like utf8::decode on source text.

>
> When we use unicode_strings we imply, that bytes in range 7f-ff should
> behave like if they were utf8::upgrade'd

utf8 says that the source code text is encoded in utf8. 
unicode_strings, as you said, treats 80-ff with the same semantics they 
would have if they were stored in utf8.
>
> But why there are so many workarounds through code to support byte
> strings as utf8 strings if we could just implicitly upgrade source
> string and get everything works?

As I understand it, a fork of Perl called Kurila, does just exactly 
that.  This happened before I was involved, but I understand that the 
Perl community decided that they did not want to go this route. 
Operating on a utf8 string is much slower than on a non-utf8 string. 
There are things that could speed this up, but in the mean time, this is 
a big obstacle.

> If we do so, then other modules should not know about feature unicode
> strings, they just receive unicode string and work with it.

This would break modules that don't handle utf8, as well as those that 
expect 80-ff to not behave as Latin1 characters.  It might be that, as 
someone once pointed out on this list, that it would make 80% of them 
work with fewer bugs, but it might break others.
>
> With current solution, as I understand it, we got a situation, when
> almost all old code (esp XS) become inconsistent with current native
> perl behavior ("\262" == UTF8+"\302\262").

I don't understand your point.  If you don't use 'unicode_strings', your 
code will work as it used to before Unicode came along.
>
> It also would be good to heard an opinion of Karl Williamson, since he
> didn't post anything on my previous message about "the unicode bug"

I agreed with what others had said.
>
> And if this is not possible solution (by some reason), then it's worth
> writing a short guide about interaction between bytes, locale, utf8,
> unicode_strings, utf8::* and especially XS work with all this stuff.

That's a good idea, and I believe Chip was working on something like 
that.  I myself have no expertise in the XS arena, but I could 
contribute to other parts.  There are recent and scheduled newly revised 
books on Perl that may address some of your concerns.
>
> Perl is changing too fast, I'm very happy with it, but I think we need
> better docs.
>
> --
> Best wishes,
> Vladimir V. Perepelitsa aka Mons Anderson
> <inthrax@gmail.com <mailto:inthrax@gmail.com>>, <mons@cpan.org
> <mailto:mons@cpan.org>>
> http://github.com/Mons


Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About