develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Marc Lehmann
March 30, 2007 23:41
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Sat, Mar 31, 2007 at 03:03:21AM +0200, Juerd Waalboer <> wrote:
> JSON is pretty big to just quickly examine. I have nothing set up for
> testing it.

Not my problem. Your coding style cnanot handle it, though, so in your own
interest you should try to examine it some day.

> > > I'm constantly very explicitly and verbosely telling people to NOT look
> > > at the flag, NOT set it manually, etcetera.
> > So why do you propose that people have to make sure that they never put a
> > binary string with the UTF-X flag set into unpack?
> Not unpack in general, but unpack "C".
> Because "C" is explicitly catered for byte data, which strings with the
> UTF8 flag aren't.

Well, you are not tlaking of Perl here.

> It won't always catch mistakes, because indeed lack of
> the flag says nothing, but it can help catch some of them.

Having the flag means nothing, either.

> Perl already has a similar warning in many places, for example when you
> print such a "wide character" on a filehandle that has no encoding or
> utf8 layer. Some modules, like MIME::Base64, provide the same
> functionality.

It is similar, but it works completely different: It only warns if you pass
something into a function/filehandle that knows that it is expecting binary

Unlike unpack, the UTF-X flag has nothing to do with the warning: the warning
tells you that the data you pass in is not binary data because it contains at
least one character >255. Thats completely fine. But when I do pass in a
string only consisting of octets (in the perl level), then it gets passed
into the funciton as binary, as one would expect.

And that, again, has nothing to do with the UTF-X flag. Data passed into
such a function gets properly downgraded (that process is what actually
generates the warning, btw).

> > How are users supposed to do that, unless they know about he flag in the
> > first place?
> By keeping byte strings and text string separate. Please either accept
> this, or stop asking me questions that will lead to this answer.

I am asking about how users do that, I am not askign what you think they
should do. I am asking specifically _how_ your idea should be put into
practise. I gave you an example where the only currently known way to do that
is by knowing and manipulating the internal UTF-X flag.

And since you have not given an answer to that question, it stays a valid

The problem is that your coding style cannot resolve this situation, as
the module in question (JSON::XS) does not know wether the given piece of
data is binary or text. Only the user knows, but by ghen it is already

> > Right, and then you want perl functions to die depending on the setting of
> > that flag, even though you also claim Perl users should not need to know
> > about it.
> The warning would not be a new feature, but an existing feature applied
> in more places. "die" is probably too harsh indeed.

No part in perl acts like that, see above, the parts that generate that
warning are all downrading properly, ensuring the perl promises of string
handling are kept.

> When they get the error message, they can read the following in
> perldiag:
>        Wide character in %s
>            (W utf8) Perl met a wide character (>255) when it wasn’t expecting one.  This warning is by default on for I/O
>            (like print).  The easiest way to quiet this warning is simply to add the ":utf8" layer to the output, e.g.
>            "binmode STDOUT, ’:utf8’".  Another way to turn off the warning is to add "no warnings ’utf8’;" but that is
>            often closer to cheating.  In general, you are supposed to explicitly mark the filehandle with an encoding,
>            see open and "binmode" in perlfunc.
> Changing the order of these sentences is on my to-do list.

You are completely confused. I am talking about octet strings (or byte
strings in your parlance). That string _never_ triggers that warning,
regardless of how it is encoded internally, because octte strings nver
contain wide characters.

Thats how the abstraction should work.

Your change of warning when the UTF-X bit is set would break that
abstraction, because users suddenly would get that warning for strings that
do not contain wide characters *at all*.

Thats I can only call very misleading to users.

> Note how this clear explanation doesn't mention the UTF8 flag!

Exactly: because you didn't understand the mechanics of that warning
because it doesn't do what you claim it does, namely warn if the UTF-X
flag is set but instead does the right thing and warns when there *is* a wide
character in the string, regardless of how it was encoded.

Do you finally understand? Please!

> > You want perl functions to behave different depending on wether that flag is
> > set or not. I want perl functions to behave the same, regardless of the fact.
> I want Perl to warn about certain mistakes when it can.

No, you want Perl to warn even when no mistakes happened because you
equate UTF-X flag with "contains no (binary) octets/bytes".

But thats not how Perl works. Thats where you misunderstand how the UTF-X
flag works. Perl warns on real problems (and probably should die), not
because the UTF-X flag happens to be set, which is misleading.

Do you finally understand how Perl works?

> > > That's not what I said, nor what I meant. In fact, quite the opposite.
> > So then unpack should not croak when it sees the UTF-X flag?
> No, it should warn instead. From now on, I no longer think it should die. It
> should warn, and people who want it to die can do so with "use warnings FATAL".

Of course it should not warn. That *exposes* the UTF-X flag to the
user. And the warning you quote would simply be wrong, because users would
get that warning even when no wide character is in the string at all.

> I don't usually read bug reports, and never claimed to have done so.
> But in this special case, I will make an exception, and read the Unicode
> related bug reports that you have submitted.

Maybe you learn what the UTF-X flag does, and why it shouldn't be exposed
in the way you think it should be or is currently exposed.

The UTF-X flag is *no* indication of a wide character whatsoever. In Perl.

I think its obvious by know that you are do not know very much about
unicode handling vs. the UTF-X flag in Perl. At least your knowledge is
mostly wrong it seems.

And thats sad, because it could be very simple, and for the most part already
is very simple: Often used modules will simply be improved to use SvPVbyte
explicitly, even if there is no default typemap support for it. And Modules
requiring binary data will eventually be fixed to use "U" instead of "C" for
decoding single octets. And the rest of perl works relatively fine, and the
remaining issues will be fixed, too.

I just think it would be much better for Perl if those changes were not
required and things would just continue to work by providing backwards

                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About