develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Marc Lehmann
March 30, 2007 17:43
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Sat, Mar 31, 2007 at 02:04:53AM +0200, Juerd Waalboer <> wrote:
> I've said it on p5p at least a dozen times, but I'll say it again:
> If the UTF8 flag is set, you can be sure that you have a text string.

Repeating wrong statements does not make them true.

> If you have a text string, the UTF8 flag may or not be set. If you have
> a byte string, the UTF8 string is not set (or it was set because you
> treated the byte string as a text string).

No, please look at my example of JSON.

> > The problem with your approach is that you have to expose the UTF-X flag
> > to users. Which comes with a lot of problems.
> Again: you're kidding, right?
> I'm constantly very explicitly and verbosely telling people to NOT look
> at the flag, NOT set it manually, etcetera.

So why do you propose that people have to make sure that they never put a
binary string with the UTF-X flag set into unpack?

How are users supposed to do that, unless they know about he flag in the
first place?

No, I am not kidding. You are part of the crowd who wants to expose the
UTF-X flag to the perl level, despite your claims that you do not want to.

> Heck, I've even explained that I think you should try to (pretend to) be
> ignorant about the internals, in response to your message even!

Right, and then you want perl functions to die depending on the setting of
that flag, even though you also claim Perl users should not need to know
about it.

So you tell users when they get that error message that they did somethign
wrong that they should not care about?

No, I am certainly not kidding.

> I do not understand how you are able to misinterpret this message even
> after this many posts in this thread alone. Have you ever read
> perlunitut, even?

As I said, I have no such manpage, and even if I had, it has nothing to do
with this. I am not misinterpreting your message at all.

You want perl functions to behave different depending on wether that flag is
set or not. I want perl functions to behave the same, regardless of the fact.

You expose the UTF-X flag that way. I don't.

You *are* contradicting yourself, but that has nothing to do with me not
reading that document or not. Thats alone your problem.

Either you do expose the UTF-X flag by making perl functions behave
differently, or you don't.

No matter of claiming you donot want to expose it can fix that: You do,
wether you want or not, if you change Perl semantics to make a difference.

> That's not what I said, nor what I meant. In fact, quite the opposite.

So then unpack should not croak when it sees the UTF-X flag?

> If you're just spending this evening just to get on my nerves, then
> congratulations!

No, I am trying to make you understand the typeless nature of Perl, and
that your proposals expose the UTF-X flag, no matter what you *want*.

You could just understand that for a change, then maybe you wouldn't need to
accuse me of just trying to get on your nerves.

I do understand that you said you do not want to expose that flag. But as
long as the changes you propose do that, it is being exposed.

I am sorry that I can't say it any clearer.

> > Because "internal format" strings can store binary data just as well,
> > and often does.
> Yes, and when you use such a byte string as a text string, its bytes are
> considered to be codepoints, just like in latin1.

Yeah, sure. Mind you: no mention of UTF-X.

> > I am talking purely about the perl level strings. If perlunitut confused
> > the issue by talking about internal encoding it completely failed its
> > mission, imho.
> I strongly suggest that you READ the document before whining about its
> supposed failure.

Well, I trust that you don't misquote its contents. Did you?

> > The problem is that some parts of perl make a difference bewteen the
> > very same string, depending on how it is encoded internally, _even if
> > the encoding is the same on the Perl level_.
> Those are bugs. Report them, and they might get fixed.

I did. Thats the whole point of this thread. I reported them a number of
times. How could you miss that?

> > > utf8::encode is a text operation. It will assume that whatever you give
> > > it, is a text string. Its characters are considered Unicode codepoints.
> > Where does it say so?
> Well, you have already denied that "encoding is going from characters to
> bytes" is a real world fact, so I guess there's little point in pointing
> out the places where exactly the same thing is explained.

If it is wrong, its wrong. No matter how often you try to explain
it. People do store octets in UTF-8. Even perl extends UTF-8 to UTF-X
to make interesting usages possible. So yes, if thats broken, then Pelr
is already broken, fundamentally, by allowing non-unicode-codepoints in

Choose two: your claims are wrong, or Perl is wrong. Either way suits me,
although I personally think the current model makes much more sense then
your user-has-to-care-for-UTF-X flag explicitly model.

> > > you need to know some internals.
> > Wrong. I need know no internals
> A certain Marc Lehmann once said:
> "I would love if that were the case, but the powers to be decided that
> every perl progarmmer has to know those internals, and needs to be able
> to deal with them."

Yes. Any problems with that?

As you like to quote with misleading context, let me add that the context
was unpack and perl modules using it or XS, not utf8::encode.

You make a classical logical fallacy: just because some parts of Perl do
not force you to know internals this does not mean that all of Perl does
not force you.

> > > That makes no sense, because UTF-8 is a means of representing
> > > characters. Byte strings consist of bytes, not characters.
> > Not in C, which is what the documentation constantly refers to, mind
> > you.
> And that is bad, I agree. Perl programmers should not be expected to
> speak C in order to understand Perl documentation. This is a big
> problem in Perl's documentation, but who's going to fix it?

I donot suffer from it. I just want sane behaviour in Perl, which doesn't
force me to think about wether my UTF-X flag could be set and my program
could die because of that, but where I get the correct and expected

                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About