develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Marc Lehmann
March 30, 2007 17:12
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Sat, Mar 31, 2007 at 12:19:16AM +0000, Tels <> wrote:
> Anyway, I wasn't aware that any non-utf8 data in Perl is *always* 
> ISO-8859-1, I thought that, when not specified, this depended on some other 
> stuff. Guess I need to reread the tutorials. :)

He, because its not true :)

> However, this also poses the question: How does Perl know that your data is 
> in KOI8-R?

It doesn't. Perl ideally only interprets character indices as unicode
codepoints (I am ignoring use locale and similar issues here). So when you
want to match your koi8-r data aginst a regex, you need to decode it first.
Perl doesn't know that and will *then* treat your character data as KOI8-R
(and afterwards as unicode).

Unless you force perl to apply unicode interpretations to your characters,
they are completely encoding-free.

> One of the limitations of the "there can be only two encodings" of Perl 
> seems to be that strings are permanently upgraded:

Thats the root of the problem. There aren't two encodings. There is only one:
characters concatenated to form strings.

Internally, Perl currently has two forms for that, just as perl can store
real integers and doubles in a scalar.

But on the Perl level, "5", "5.0", 5 and utf8-encoded 5 are all the same

> 	if ($iso_8859_1 eq $utf8) { ... }
> Please correct me if I am wrong, but I do think it is not be possible to 
> keep both variables in their current encoding and only temporarily upgrade 
> them to utf8 (for the common encoding that contains both of them)?

It is, but likely not very efficient as in most such cases you actually
want utf-x internally. Except for optimisation purposes (where I see
downgrade and upgrade as well-warranted), you do not have to care, as perl
handles thta automatically.

> After reading this discussion here, a lot of problems also seem to stem from 
> the fact that the upgrade to utf8 is permanent, silently and 
> done "behind-the-scenes". Just like 1 + 2.0 will result in 3.0 and not 3 
> and we all know how much confusion this creates :) (heh, I fell for it 
> today, even tho I should have know better :)

No, there is no problem in most cases, as the upgrade does not change the
scalar in any way (except, again, for speed). Or at least should.

Perl achieves that goal by transparentlxy re-encoding its internal format
as required. re-coding in that way does not change the semantics of the
string, except:

- when you hit a bug in perl
- when you use unpack "C".

So in a bug-free perl without unpack, everythign just works and you never
need to care about wether perl stores the data as UCS-4, UTF-X or octets
in memory.

Thats the "sane" model introduced with 5.6 and mostly achieves with 5.8.8.

The problem are thre remainign bugs AND unpack, the latter of which breaks
existing programs that assume unpack "C" has byte semantics, when, in
fact, it returns the internal encoding that perl normally hides from you
and tells you to ignore.

If those remaining problems were fixed (that included SvPV), the only
difference between utf-x encoding and octet-encoding within perl would be
speed, but not semantics.

Thats the beauty.

Juerds goal of having the UTF-X flag exposed and having you to think about
when perl upgrades and downgrades (and making you avoid the upgrades) is
horrible, as it forces a lot of administration on the programmer, a lot of
which perl already claims to do, as only in a few cases you have to know
your UTF-X flag at the moment.

> > The same type of string can be used for binary data, because in the
> > unicode encoding "latin1", all 256 codepoints map to the same byte
> > values.

latin1 is not a unicode encoding in the first place.

Also, I find it much more natural to represent bytes as characters 0..255 in
perl, as opposed to Juerds definition of characters 0..255 with the internal
UTF-X flag cleared.

I just don't see why the programmer has to learn about that internal flag
at all. If he has to, then perl could become much much faster by forcing
her to do that all the time, instead of only in unpack or XS cases.

> great minds sink alike or so) And since unlike in Perl, upgradings are 
> never done permanently, you can keep your BINARY string and compare it to 
> UTF-8 whatever, and it never gets "corrupted".

In the 5.5 model, nothing ever gets "corrupted", too. Thats the beauty of it.
Because scalars with the UTF-X flag set behave the same way as scalars not
having it set, everything is compatible with each other.

Its only the cases _where_ it makes a difference where this is a problem
and in fact stuff gets corrupted.

> I am not sure how one could achive that in Perl. Making the SV read-only?

By fixing the remaining bugs and making the UTF-X flag truely internal, so
you do not have to worry about modules corrupting your stuff.

Thats what perl does for you in the vast majority of cases already, and it
should simply do that all the time, so programmers have their typeless perl
that they love again.

> > > In short, it becomes a mess.
> >
> > Yes, with strong typing, especially with string subtypes for arbitrary
> > encodings, it would be cleaner. But it would also not look like Perl
> > 5.

I beg to differ. Strong typing makes programming hard. Until Perl6 came and
destroyed it, the typeless nature of Perl was a feature, not a problem.

Why should perl suddenly introduce types for strings when a single abstratc
string type works just as wonderful as the single abstract scalar type works
in perl already?

Having strongly typed integers/doubles/utf-8-strings etc. is a step
backwards from perl towards Java.

Programmers using Perl do not want to worry about strict typing. They can
use C++ or Java anytime for that.

> Over the years, I come to the insight that I want to build reliable and
> fast programs. (easy to maintain, reliable, fast, pick two :-)
> So maybe we really need "use strict 'encodings';" :-)

What for, so that your program crashes at runtime instead of degrading to a
slower but corretc case in case it happens to hit binary data? You surely do
not want this, or do you?

                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About