develooper Front page | perl.perl5.porters | Postings from February 2018

Re: [perl.git] branch blead updated. v5.27.8-353-g4b262b1674

Thread Previous | Thread Next
Karl Williamson
February 25, 2018 20:45
Re: [perl.git] branch blead updated. v5.27.8-353-g4b262b1674
Message ID:
On 02/25/2018 08:57 AM, H.Merijn Brand wrote:
> On Sun, 25 Feb 2018 09:29:02 -0600, "Craig A. Berry"
> <> wrote:
>> On Feb 25, 2018, at 7:12 AM, H.Merijn Brand <> wrote:
>>> On Sat, 24 Feb 2018 08:59:57 -0600, "Craig A. Berry" <> wrote:
>>>> On 2/20/18 4:55 AM, H.Merijn Brand wrote:
>>>>> In perl.git, the branch blead has been updated
>>>>> <>
>>>>> - Log -----------------------------------------------------------------
>>>>> commit 4b262b167490219129b4e3f8e3f6403a31ed95d2
>>>>> Author: H.Merijn Brand <>
>>>>> Date:   Tue Feb 20 11:55:17 2018 +0100
>>>>>      Move note about defective locale on HP-UX to README.hpux
>>> @Karl, I just *moved* the remarks from hints to README, just slightly
>>> altered the text to better fit a README. Did I interpret the original
>>> comment incorrectly?
>> You reversed its meaning by changing "this locale doesn't think" to
>> "this locale thinks."
> Fixed and pushed
>>>>> -----------------------------------------------------------------------
>>>>> Summary of changes:
>>>>>   README.hpux   | 9 +++++++++
>>>>>   hints/ | 5 -----
>>>>>   2 files changed, 9 insertions(+), 5 deletions(-)
>>>>> diff --git a/README.hpux b/README.hpux
>>>>> index e1857e08dc..3bd4be3e3d 100644
>>>>> --- a/README.hpux
>>>>> +++ b/README.hpux
>>>>> @@ -563,6 +563,15 @@ questions about 64-bit numbers when
>>>>> Configure asks you, you may get a configuration that cannot be
>>>>> compiled, or that does not function as expected.
>>>>> +=head2 Locales on HP-UX
>>>>> +
>>>>> +HP-UX installs the locale C<univ.utf8> on all systems. Up to and
>>>>> +including HP-UX 11.23, this local is defective in that it thinks
>>>>> that +the characters C<< $ + < = > ^ ` | >> and C<~> are
>>>>> punctuation, which +they are not according to the Unicode
>>>>> standards.
>>>> I think you said this backwards and reversed "doesn't think" in
>>>> the hints to "thinks".  These characters are members of the punct
>>>> class according to POSIX:
>>>> <>
>>>> It's confusing because the author(s) of the standard decided to
>>>> use the word "punctuation" to describe things that are definitely
>>>> not punctuation, apparently not knowing that characters such as
>>>> mathematical symbols and currency symbols are not punctuation.  As
>>>> a result, ispunct() is useless for identifying punctuation and
>>>> really just identifies printable characters that are not letters,
>>>> numbers, or space.
>>>> I guess the only notable thing for Perl is that it complies with
>>>> the current standard even where the C library uses an older and
>>>> correct but non-standard definition.
>>> I can make it direction-neutral as in
>>> HP-UX installs the locale C<univ.utf8> on all systems. Up to and
>>> including HP-UX 11.23, this local is defective in that it disagrees
>>> with Unicode on the characters C<< $ + < = > ^ ` | >> and C<~> being
>>> punctuation or not, which they are not according to the Unicode
>>> standards.
>> These *are* punctuation according to POSIX (and apparently Unicode).
>>> @Karl, is en_US.utf8 similarly affected?
>>> If so, should those tests be skipped om all HP-UX <= 11.23 ?
>> I believe Karl plans to remove the warning as it causes trouble on
>> AIX and VMS as well.  The smoke-me/khw-locale branch no longer has
>> those troubles.  To me it would make sense to document somewhere that
>> Perl uses its own standards-compliant character classes even when the
>> local C library doesn't comply with the current standard,
> 👍 +1
>> but giving users a run-time warning that they can't do anything about
>> seems harsh.

I already was in the process of updating perldiag, but I can't push 
until we have a perldelta suitable for 5.27.10.

Craig's emails convinced me to do some more digging, and found things 
that I had forgotten.

These locales are not defective, but I believe the message is warranted 
because perl won't follow the behavior dictated by the locale.

The warning can be squelched by turning off the locale warnings category.

I think the text in README.hpux should be entirely deleted, as the 
perldiag changes should cover that; it isn't just an hpux issue.

A little background:  The only feasible way for UTF-8 locales to be 
implemented was for Perl to just use the code it already had for 
handling UTF-8, and to not actually look at the locale definitions, but 
to assume they were valid UTF-8 once it determined they were intended to 
be UTF-8.

Fast forward several years, and I stumble across these examples where 
the assumption is not valid, and it came as a  surprise to me.  Perl is 
going to continue, at least in the short run, to use its mechanisms to 
implement UTF-8 locales, even if that means it doesn't follow the actual 
locale definition precisely.  Hence the warning.

Here's the text I'm planning to put into perldiag, which should help 
clarify the situation
=item Locale '%s' contains (at least) the following characters which
have unexepected meanings: %s  The Perl program will use the exepected

(W locale) You are using the named UTF-8 locale.  UTF-8 locales are
expected to have very particular behavior, which most do.  This message
arises when perl found some departures from the expectations, and is
notifying you that the expected behavior overrides these differences.
In some cases the differences are caused by the locale definition being
defective, but the most common causes of this warning are when there are
ambiguities and conflicts in following the Standard, and the locale has
chosen an approach that differs from Perl's.

One of these is because that, contrary to the claims, Unicode is not
completely locale insensitive.  Turkish and some related languages have
two types of C<"I"> characters.  One is dotted in both upper- and
lowercase, and the other is dotless in both cases.  Unicode allows a
locale to use either the Turkish rules, or the rules used in all other
instances, where there is only one type of C<"I">, which is dotless in
the uppercase, and dotted in the lower.  The perl core does not (yet)
handle the Turkish case, and this message warns you of that.  Instead,
the L<Unicode::Casing> module allows you to mostly implement the Turkish
casing rules.

The other common cause is for the characters

  $ + < = > ^ ` | ~

These are probematic.  The C standard says that these should be
considered punctuation in the C locale (and the POSIX standard defers to
the C standard), and Unicode is generally considered a superset of the C
locale.  But Unicode has added an extra category, "Symbol", and
classifies these particular characters as being symbols.  Most UTF-8
locales have them treated as punctuation, so that L<ispunct(2)> returns
non-zero for them.  But a few locales have it return 0.   Perl takes the
first approach, not using C<ispunct()> at all (see L<Note [5] in
perlrecharclass|perlrecharclass/[5]>), and this message is raised to
notify you that you are getting Perl's approach, not the locale's.


For 5.29, I'll look into how feasible it would be in Turkish locales to 
implement the differences, as Unicode::Casing fails to handle the /i 
pattern matching case properly.

I think the other instance is going to have to remain as-is.  It's 
worked this way for a long time; and the new message merely informs the 
user of the already-existing discrepancy.  Unicode tried to change some 
of this a few years ago (I don't remember the details), and backed off 
due to the claimed breakage it would cause.  I note that later versions 
of HP-UX changed the locale defintion to what Perl was already doing, 
and I presume they didn't make the change accidentally, but realized it 
was the least worst alternative.

>>>>> +This appears to be fixed on HP-UX 11.31.
>>>>> +
>>>>>   =head2 Oracle on HP-UX
>>>>>   Using perl to connect to Oracle databases through DBI and
>>>>> DBD::Oracle diff --git a/hints/ b/hints/
>>>>> index 3eef0388a7..91a4d7d388 100644
>>>>> --- a/hints/
>>>>> +++ b/hints/
>>>>> @@ -1,10 +1,5 @@
>>>>>   #!/usr/bin/sh
>>>>> -# The locale 'univ.utf8' is defective on some of these systems,
>>>>> as it doesn't -# think that
>>>>> -#   $ + < = > ^ ` | ~
>>>>> -# are punctuation.  This is fixed in 11.31
>>>>> -
>>>>>   # Determine the architecture type of this system.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About