develooper Front page | perl.perl5.porters | Postings from August 2013

Re: [perl #117355] [lu]cfirst don't respect 'use bytes'

Thread Next
Aristotle Pagaltzis
August 15, 2013 22:46
Re: [perl #117355] [lu]cfirst don't respect 'use bytes'
Message ID:
* Victor Efimov via RT <> [2013-08-12 23:20]:
> file not found No such file or directory at line 29.

That is really the last remnant (I think) of The Unicode Bug. The
problem here is that `open` and all the other file-related functions
blithely ignore the UTF8 flag, which is utterly broken.

Your use of is a workaround. And however useful it may be while
the bug persists, a workaround is all it is. It *isn’t* a legitimately
good use case for

* Victor Efimov via RT <> [2013-08-12 22:40]:
> Another possible use of bytes are:
> 1) run-time, production-enabled assertions (
> ).
> It's similar to debugging, except performance matters.
> 2) Unit tests (sometimes performance matters).

I have no idea what the concept of assertions or that of unit tests has
to do with the internal representation of strings, or how what you wrote
afterwards is related to those things.

> Below example contains a bug (from Perl point view this can be treated
> as not-a-bug, but from programmer point of view it's a bug).
> (bug marked with "# THIS LINE CONTAINS A BUG")

OK, to cut a long story short, the line is

    my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG

in which $ascii_u has the UTF8 flag set even though it contains only
characters < 128, and $bin contains characters between 128 and 255 yet
*doesn’t* have the UTF8 flag set.

Your complaint is that the concatenation blindly produces a string with
the UTF8 flag set, which requires the contents of $bin to be upgraded to
produce $bin, which takes up extra space, despite the string containing
only characters < 256.

This is not a bug, though it certainly is suboptimal.

In theory Perl string operations could try to produce downgraded strings
whenever possible, but that requires scanning string in many cases where
it currently doesn’t happen. Things would almost certainly actually get

> It does not affect anything, even program output, except
> performance/memory usage.
> bin_u is 7 bytes length, and bin_a is 4 bytes length.
> if 7 vs 4 bytes looks unimportant, consider 700 vs 400 MiB of binary files.
> And this bug can be caught (runtime or in unit tests) if line "#die if
> is_wide_string($bin_u);" uncommented.
> The only possible way to catch this is a use of bytes::length (or
> similar function which count bytes), because final output is same with
> or without bug.


The only possible way to catch this is encoding::warnings (with FATAL),
because if you try to catch it manually, you will miss places where you
would need to put checks.

Also, if you *already know* (some of) the places in your program where
this can happen, then the workaround is not to try to “catch” it after
it happened, but – again! – to utf8::downgrade your strings before you
concatenate them.

    utf8::downgrade($bin, 1);
    utf8::downgrade($ascii_u, 1);
    my $bin_u = $bin.$ascii_u; # THIS LINE NO LONGER CONTAINS A BUG

So here too, I do not see being useful in any real way. In fact
in this case it doesn’t even provide a useful piece of a workaround.

* Victor Efimov via RT <> [2013-08-12 22:55]:
> Code below prints that two strings are same. Tries to open file with
> name defined by one string, and then to reopen file with name defined
> by second string. Second attempt fail.

This is exactly the same bug as in your first comment on this issue, i.e.
just The Unicode Bug in `open` and friends.

Aristotle Pagaltzis // <>

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About