develooper Front page | perl.perl5.porters | Postings from October 2018

[perl #133588] Symbol for 'micro' is erroneously uppercased to Greek'MU'.

Thread Previous | Thread Next
From:
Dan Book via RT
Date:
October 15, 2018 16:21
Subject:
[perl #133588] Symbol for 'micro' is erroneously uppercased to Greek'MU'.
Message ID:
rt-4.0.24-18391-1539620475-603.133588-15-0@perl.org
Since the RT interface uselessly hides the original message, it is reproduced below.

> The uc() function converts the UTF-8 symbol for 'micro' into an upper
case Greek 'mu', which is incorrect.  The 'micro' symbol has no upper
case equivalent and should remain unchanged by uc().
> 
> perl -e 'use feature unicode_strings ; binmode STDOUT, ":encoding(UTF-8)" ; my $txt = "\xce\xbc\xc2\xb5" ; print utf8::decode($txt), "\n" ; print $txt. "=>", uc($txt), "\n"'
> 
> A latin-1 'micro' symbol is also converted by uc() to the UTF-8
upper-case Greek 'mu' which can result in a string with mixed
encoding.  Not pretty.
> 
> perl -e 'use feature unicode_strings ; binmode STDOUT, ":encoding(UTF-8)" ; my $txt = shift ; print uc($txt), "\n"' Telecomunicações
TELECOMUNICAçÃΜES
> 
> Behaviour is the same in Perl 5.20 and Perl 5.24 in both cases.

Your message is a bit imprecise which may result from some confusion between UTF-8 and Unicode. These are not UTF-8 or Latin1 symbols, they are Unicode symbols. UTF-8 and Latin1 are character encodings that represent Unicode codepoints.

In order to correctly operate on Unicode characters, they must first be decoded from whatever encoding they may be in (usually, UTF-8 on modern systems). The first issue you report can be better reproduced without involving character encodings:

$ perl -MEncode -E'say sprintf "%vX", uc("\N{U+03BC}")'
39C

$ perl -MEncode -E'say sprintf "%vX", uc("\N{U+00B5}")'
39C

This is correct according to the latest Unicode rules; both U+00B5 MICRO SIGN and U+03BC GREEK SMALL LETTER MU list U+039C GREEK CAPITAL LETTER MU as their uppercase variant.

The second issue you report is a misunderstanding of encoding as I mentioned. Your input text must first be decoded from UTF-8 as that is how it gets passed from the shell. Your binmode on STDOUT then does the job of encoding it back to UTF-8 for output to the terminal. If you only handle one of these directions your result will be confused as you saw.

$ perl -MEncode -E'say encode "UTF-8", uc(decode "UTF-8", shift)' Telecomunicações
TELECOMUNICAÇÕES

This can also be handled automatically for oneliners using the -C command line switch as documented at https://perldoc.pl/perlrun#-C-%5Bnumber/list%5D .

$ perl -CSAD -E'say uc(shift)' Telecomunicações
TELECOMUNICAÇÕES

See https://stackoverflow.com/a/6163129/5848200 for more information.

-Dan

---
via perlbug:  queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=133588

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About