develooper Front page | perl.perl5.porters | Postings from October 2011

The Unicode Bug, unicode_strings and utf8

Thread Next
From:
Mons Anderson
Date:
October 23, 2011 14:15
Subject:
The Unicode Bug, unicode_strings and utf8
Message ID:
CAOgz_58W4taNb3WRFdDFsZV1kvH56kvo4_EhrtV7fJ8TMWn=gw@mail.gmail.com
I've looked through recent changes and discussions about unicode_strings
and "The Unicode Bug". I've also looked through the history of introduction
of UTF8 flag.

Let's compare feature unicode_strings with use utf8

When we use pragma utf8, we implicitly raise the UTF8 flag on strings, that
are not written as "\x{...}"
Except those in range 7f-ff.
I.e. it's like utf8::decode on source text.

When we use unicode_strings we imply, that bytes in range 7f-ff should
behave like if they were utf8::upgrade'd

But why there are so many workarounds through code to support byte strings
as utf8 strings if we could just implicitly upgrade source string and get
everything works?
If we do so, then other modules should not know about feature unicode
strings, they just receive unicode string and work with it.

With current solution, as I understand it, we got a situation, when almost
all old code (esp XS) become inconsistent with current native perl behavior
("\262" == UTF8+"\302\262").

It also would be good to heard an opinion of Karl Williamson, since he
didn't post anything on my previous message about "the unicode bug"

And if this is not possible solution (by some reason), then it's worth
writing a short guide about interaction between bytes, locale, utf8,
unicode_strings, utf8::* and especially XS work with all this stuff.

Perl is changing too fast, I'm very happy with it, but I think we need
better docs.

-- 
Best wishes,
Vladimir V. Perepelitsa aka Mons Anderson
<inthrax@gmail.com>, <mons@cpan.org>
http://github.com/Mons

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About