develooper Front page | perl.perl5.porters | Postings from April 2007

[BUG+TEST] Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
Juerd Waalboer
April 4, 2007 17:05
[BUG+TEST] Re: perl, the data, and the tf8 flag
Message ID:
Dr.Ruud skribis 2007-04-04 21:30 (+0200):
> >     A   A byte-wide string with arbitrary binary data, will
> >         be space padded.
> Is that ASCII-space (0x20) or can it be locale or EBCDIC-space (0x40)
> too?

    memset(cur, datumtype == 'A' ? ' ' : '\0', len);

I don't know how in C ' ' is interpreted on an EBCDIC platform, and if
any translation from ASCII to EBCDIC happens before compiling, etcetera.
So I can't answer this question.

But while I was reading the source, I found this interesting part in
a 5.9.5:

    /* 'A' strips both nulls and spaces */
    const char *ptr;
    if (utf8 && (symptr->flags & FLAG_WAS_UTF8)) {
                !is_utf8_space((U8 *) ptr)) break;
    } else {
            if (*ptr != 0 && !isSPACE(*ptr)) break;

While 5.8.8, only the latter (isSPACE) is used. This means that the bug
that the regex engine has, is now copied to pack. Hurrah! :)

In short: the UTF8 flag is again used to decide between ASCII and
Unicode semantics, while a non-UTF8-flagged text strings are latin1,
which is Unicode too.

If unpack wants to treat byte data encoded as utf8 like it treats
unencoded byte data, upgrading non breaking space must not make any
difference. In unicode, U+00A0 is whitespace, but former Perls have not
considered \xa0 whitespace in unpack.

    use v5.9.5;
    use strict;
    use warnings;
    use Test::More tests => 1;

    my $nbsp1 = "abc\xa0  ";
    my $nbsp2 = $nbsp1;

    my $unpacked1 = unpack("A*", $nbsp1);
    my $unpacked2 = unpack("A*", $nbsp2);

    is($unpacked1, $unpacked2);

I maintain that supporting UTF8 flagged strings with unpack is a
waste of effort. But if it is done, then it must be done correctly and
compatibly, or the hurting continues and the effort will have been in vain.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About