develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Juerd Waalboer
March 30, 2007 17:05
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
Marc Lehmann skribis 2007-03-31  1:33 (+0200):
> The difference between us, and thats what it boils down to, is that you give
> the internal UTF-X bit meaning. You equate UTF-X flag set == Unicode string.

No, that's a unidirectional thing.

I've said it on p5p at least a dozen times, but I'll say it again:

If the UTF8 flag is set, you can be sure that you have a text string.
If the UTF8 flag is not set, it can be either a byte string or a text

If you have a text string, the UTF8 flag may or not be set. If you have
a byte string, the UTF8 string is not set (or it was set because you
treated the byte string as a text string).

> The problem with your approach is that you have to expose the UTF-X flag
> to users. Which comes with a lot of problems.

Again: you're kidding, right?

I'm constantly very explicitly and verbosely telling people to NOT look
at the flag, NOT set it manually, etcetera.

Heck, I've even explained that I think you should try to (pretend to) be
ignorant about the internals, in response to your message even!

I do not understand how you are able to misinterpret this message even
after this many posts in this thread alone. Have you ever read
perlunitut, even?

> Initially I thought you, too, wanted a unicode model where the UTF-X bit is
> not exposed to the perl level. But in fact the opposite is true: you
> forc> knowledge of the UTF-X bit on users, even though it should be
> transparent.
> ...
> the problem is you want them to track the UTF-X flag in addition to that.
> ...
> Then why do you want to force people to know about how
> 128..255 is encoded internally then? 

That's not what I said, nor what I meant. In fact, quite the opposite.

If you're just spending this evening just to get on my nerves, then

> > Oh, but they do. Please read perlunitut, which tries to redefine the
> > universe into four important definitions (and succeeds).
> I do not have that manpage.'m+Feeling+Lucky

> Because "internal format" strings can store binary data just as well,
> and often does.

Yes, and when you use such a byte string as a text string, its bytes are
considered to be codepoints, just like in latin1.

> I am talking purely about the perl level strings. If perlunitut confused
> the issue by talking about internal encoding it completely failed its
> mission, imho.

I strongly suggest that you READ the document before whining about its
supposed failure.

> The problem is that some parts of perl make a difference bewteen the
> very same string, depending on how it is encoded internally, _even if
> the encoding is the same on the Perl level_.

Those are bugs. Report them, and they might get fixed.

> > utf8::encode is a text operation. It will assume that whatever you give
> > it, is a text string. Its characters are considered Unicode codepoints.
> Where does it say so?

Well, you have already denied that "encoding is going from characters to
bytes" is a real world fact, so I guess there's little point in pointing
out the places where exactly the same thing is explained.

> > you need to know some internals.
> Wrong. I need know no internals

A certain Marc Lehmann once said:

"I would love if that were the case, but the powers to be decided that
every perl progarmmer has to know those internals, and needs to be able
to deal with them."

> > That makes no sense, because UTF-8 is a means of representing
> > characters. Byte strings consist of bytes, not characters.
> Not in C, which is what the documentation constantly refers to, mind
> you.

And that is bad, I agree. Perl programmers should not be expected to
speak C in order to understand Perl documentation. This is a big
problem in Perl's documentation, but who's going to fix it?
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About