develooper Front page | perl.perl5.porters | Postings from March 2010

[perl #73376] Re: Character (or byte?) escapes under utf8 pragma

From:
Karl Williamson via RT
Date:
March 31, 2010 02:59
Subject:
[perl #73376] Re: Character (or byte?) escapes under utf8 pragma
Message ID:
rt-3.6.HEAD-6227-1269960231-981.73376-15-0@perl.org
On Sun Mar 07 09:30:39 2010, public@khwilliamson.com wrote:
> 
>   A. Pagaltzis (via RT) wrote:
> > # New Ticket Created by  "A. Pagaltzis" 
> > # Please include the string:  [perl #73376]
> > # in the subject line of all future correspondence about this issue. 
> > # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=73376 >
> > 
> > 
> > Hi Michael,
> > 
> > [ perlbug readers, you will find the nut of the issue in the
> >   section marked BUG ]
> > 
> > * Michael Ludwig <michael.ludwig@xing.com> [2010-03-03 14:05]:
> >> For convenience, I have test script source code in UTF-8. The
> >> test also deals with non-breaking spaces, which I prefer to
> >> keep as character references since they are not visible and
> >> might be mistaken by the casual onlooker for ordinary spaces.
> >> So I write them as "\xa0". Or "\x{a0}", or "\x{00a0}".
> >>
> >> Now I find that they seem to be byte references, not character
> >> references.
> > 
> > Perl does not distinguish between bytes and characters. It does
> > distinguish between scalars that use a packed byte buffer for
> > storage vs strings that use variable-width integer sequence for
> > storage, but this is an implementation detail and does not mean
> > anything in terms of semantics. Strings are simply strings in
> > Perl. You cannot tell what kind of data they contain just by
> > looking at them and the UTF8 flag doesn’t tell you either.
> > 
> >> Consider the following test script:
> >>
> >> use strict;
> >> use warnings;
> >> use utf8; # source code in UTF-8 ("Zurück")
> >> use open OUT => ':encoding(UTF-8)', ':std';
> >>
> >> my $str1 = "<<\xa0Zurück\n";      # byte -> bad
> >> my $str2 = "<<\x{a0}Zurück\n";    # should be character, but isn't
> >> my $str3 = "<<\x{00a0}Zurück\n";  # ditto
> >> my $str4 = "<<\xa0" . "Zurück\n"; # upgrading hack, works
> >>
> >> print $str1, $str2, $str3, $str4;
> >>
> >> $str1 ne $str2 and die "won't die";
> >> $str1 ne $str3 and die "won't die";
> >> $str1 ne $str4 and die 'die now, somewhat counter-intuitively';
> > 
> >     "\x{00a0}" does not map to utf8 at t.pl line 11.
> >     <<\xA0Zurück
> >     "\x{00a0}" does not map to utf8 at t.pl line 11.
> >     <<\xA0Zurück
> >     "\x{00a0}" does not map to utf8 at t.pl line 11.
> >     <<\xA0Zurück
> >     << Zurück
> >     die now, somewhat counter-intuitively at t.pl line 15.
> > 
> > This is definitely a bug.
> > 
> >> The correct version of the string uses implicit upgrading of
> >> the byte escape "\xa0" to a Unicode character. I've read
> >> upgrading should rather be avoided, but here it does the job.
> > 
> > No, upgrading is perfectly fine. Mixing byte and character data
> > is what should be avoided, because then Perl will assume it’s all
> > characters, which will result in mangling of one of the two kinds
> > of data. Usually the byte data is encoded text, in which case the
> > problem becomes apparent as double-encoded text. But it’s really
> > a problem both ways.
> > 
> >> Am I mistaken in my expectation that while "\xa0" should be
> >> a byte, "\x{a0}" and "\x{00a0}" should be characters? Note that
> >> perlretut(1) seems to support this assumption:
> >>
> >>  Unicode characters in the range of 128-255 use two hexadecimal
> >>  digits with braces: \x{ab}. Note that this is different than
> >>  \xab, which is just a hexadecimal byte with no Unicode
> >>  significance.
> >>
> >> http://perl.active-venture.com/pod/perlretut-morecharacter.html
> >>
> >> But maybe this only refers to these escapes inside regular expressions.
> > 
> > The documentation appears to be wrong. Unfortunately a lot of the
> > documentation of Perl itself is wrong or confused about Perl’s
> > string model.
> 
> This is not in perlretut at least in 5.8 or later.  I hope that 5.11 has 
> cleaned up most of the other things wrong about these latin1 bugs.
> 
> > 
> >> Or maybe the utf8 pragma breaks things here? Don't think so,
> >> though. If I comment it out, I have to recode my script to
> >> Latin1 in order for the strings to be valid.
> > 
> > Yes. This appears to be a utf8 pragma bug or a bug in the parser
> > that shows up in interaction with the utf8 pragma.
> > 
> >     ====================== BUG ======================
> > 
> > What happens is that the presence of the ü under the utf8 pragma
> > triggers using the variable-width integer sequence format for the
> > string, but the 0xA0 byte from the \x escape gets written into
> > that buffer verbatim, as if it were a packed byted array string.
> > This is wrong and completely broken.
> > 
> >     ====================== BUG ======================
> > 
> >> Note that the reason I use the utf8 pragma is so I can write
> >> "Zurück" in my source code and automatically have Perl informed
> >> that these are characters, not bytes - which is a great
> >> convenience.
> >>
> >> Yeah, it would also work in Latin1, and our editors handle
> >> various encodings just fine - but we have a good UTF-8
> >> development environment and there might be characters not
> >> representable in Latin1 that I'd like to add to the script
> >> source.
> > 
> > Writing source in UTF-8 is a perfectly sane practice. No need to
> > justify it.
> > 
> >> What's your advice for handling this situation more elegantly?
> > 
> > Use the \U escape to indicate that you always mean a Unicode code
> > point. Due to other quirks in how \U is implemented, it ends up
> > not triggering the bug that \x would.
> > 
> > Regards,
> 
> I don't understand what you mean here by the \U escape.  Starting in 
> Perl 5.11.5, \N{U+...} can be used instead of \x{...} to force Unicode 
> semantics.
> 

This bug was fixed in 5.10.1.  I can't reproduce it there nor in
5.12RC1, but can in 5.8.9.  The documentation has also been cleaned up.
I'm resolving the ticket.
-- 
--Karl Williamson



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About