develooper Front page | perl.perl5.porters | Postings from July 2017

Re: [perl #131685] Rename utf8::is_utf8() (and other functions)

Thread Previous | Thread Next
July 4, 2017 09:23
Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Message ID:
On 4 July 2017 at 11:03,  <> wrote:
> On Tuesday 04 July 2017 01:52:29 yves orton via RT wrote:
>> On 4 July 2017 at 09:19,  <> wrote:
>> > On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:
>> >> But it does deprecate the old names, which is an issue, I can't
>> >> imagine us removing these functions.
>> >
>> > Warning can be removed from patch. It is just question how you decide.
>> > Also functions stay there, but we can instruct people via documentation
>> > to use new functions for a new code... Again it is question if you call
>> > it deprecation or aliasing. In any case functions are not going to be
>> > deleted, so in final case it does not matter for old code.
>> >
>> > And for old code can be defined this function easily:
>> >
>> >   *new_name = *old_name;
>> >
>> > Reason for this patch series is:
>> > * document those utf8:: functions
>> > * allow developers to call those functions via non-cryptic names
>> I dont mind adding new aliases for these functions, I object to your
>> proposal to put them in Internals however; I think that they should go
>> in 'scalar', which we decided at the last PerlQA is the designated
>> place for functions that operate on scalars.
> I proposed Internals, because that flag is internal for perl and
> invisible for pure perl code. But if more people are happy with scalar
> namespace, I'm fine with it.
>> scalar::is_unicode_string()
>> scalar::is_binary_string()
> But this is wrong! SVf_UTF8 does not tell if scalar string is unicode
> or binary. It just tell type of internal storage.

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it, but the utf8 flag tells parts of
the internals to use different rules for certain operations, when set
those rules are Unicode. When the flag is not set the default rules
are derived from ASCII.

You can see the difference in the following:



The latter matches because \N{U+DF} produces the unicode code point
DF, and the former does not match, because  \x{DF} produces the ASCII
octet DF instead. The former is an ASCII string, and the later is a
Unicode string.

> Name is_binary_string is misleading in same way as current name is_utf8.

Erf, maybe. We need a term for "not-unicode", and "binary" is as good
as any. I don't mind other proposals.

> If you say that binary string is one with codes only in range 0x00-0xFF
> then you can have that binary string also with SVf_UTF8 flag and your
> function name "is_binary_string" would return false for your binary
> string. Such name would lead to another problems.

The SVf_UTF8 flag being off means the string should be treated as
ASCII when doing case-insensitive operations, and as binary for other
purposes, and that the data is encoded as a series of discrete octets.
It is not uncommon for people on this list to use the terms unicode
and binary for this reason.

>> I don't like the wide-storage thing, (although I admit i think it
>> better than "is_utf8"), a latin1 string in utf8 does not use
>> wide-storage,
> Of course it can. Unicode code points 0x80 .. 0xFF (which are Latin1
> extension from ASCII) contains two bytes when encoded in UTF-8 and
> therefore are wide in UTF-8 too.

I spoke imprecisely, I should have said ASCII, not latin-1.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About