develooper Front page | perl.perl5.porters | Postings from September 2012

[perl #96592] 4 pod casemapping errors: s/(\w+)/\u\L$1/g is always wrong

From:
James E Keenan via RT
Date:
September 25, 2012 20:10
Subject:
[perl #96592] 4 pod casemapping errors: s/(\w+)/\u\L$1/g is always wrong
Message ID:
rt-3.6.HEAD-11172-1348629010-1850.96592-15-0@perl.org
On Mon Aug 08 16:17:27 2011, tom christiansen wrote:
> These are all in error:
> 
>     perldata.pod:        s/(\w+)/\u\L$1/g;   # "titlecase" words

The first error could simply be deleted, as the feature which it is
documenting has nothing to do with \u\L.

>     perlfaq4.pod:	$string =~ s/([\w']+)/\u\L$1/g;
>     perlop.pod:    substr($str, -30) =~ s/\b(\p{Alpha}+)\b/\u\L$1/g;
>     perlretut.pod:string. The regexps C<\L\u$word> or C<\u\L$word>
> convert the first
> 
> They don't work because you cannot guarantee a correct titlecase
> mapping if you first send it through lowercase.  There are no
> roundtrip guarantees with Unicode casemapping.
> 
> Here are two places where you get an error doing it the way
> the pods erroneously suggest, but there are others:
> 
>     orig  => İ is 0130
>        lc => i̇ is 0069.0307
>     tc    => İ is 0130
>     tc lc => İ is 0049.0307     (wrong answer)
> 
>     orig  => ẞ is 1E9E
>        lc => ß is 00DF
>     tc    => ẞ is 1E9E
>     tc lc => Ss is 0053.0073    (wrong answer)
> 
> The correct approach requires something more like
> 
>     s/\b(\w)(\w*)\b/\u$1\L$2/g;  # "titlecase" "words"
> 
> Because casemapA(string) is never guaranteed to be the
> same as casemapA(casemapB(string)).
> 
> --tom
> 
> #!/usr/bin/env perl
> 
> use utf8;
> 
> use v5.14;
> use strict;
> use warnings;
> use open qw(:std :encoding(UTF-8));
> use charnames qw(:full);
> 
> my @chars = (
>     "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}",
>     "\N{GREEK CAPITAL THETA SYMBOL}",
>     "\N{LATIN CAPITAL LETTER SHARP S}",
>     "\N{OHM SIGN}",
>     "\N{KELVIN SIGN}",
>     "\N{ANGSTROM SIGN}",
> 
>     "\N{LATIN SMALL LETTER SHARP S}",
>     "\N{GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI}",
>     "\N{LATIN SMALL LIGATURE FF}",
>     "\N{LATIN SMALL LIGATURE FFI}",
>     "\N{LATIN SMALL LIGATURE LONG S T}",
>     "\N{LATIN SMALL LIGATURE ST}",
> );
> 
> sub report($$;$) {
>     my ($what, $str, $ok) = @_;
>     my $mask = "%-5s => %-3s is %v04X\n";
>     if (@_ == 3) {
>         $mask =~ s/\n/\t%s\n/;
>     }
>     printf $mask, $what, ($str) x 2, $ok;
> }
> 
> for my $char (@chars) {
>     my $lc         =         lc $char;
>     my $tc_good    = ucfirst    $char;
>     my $tc_bad_lc  = ucfirst lc $char;
>     my $tc_bad_uc  = ucfirst uc $char;
> 
>     report "orig " => $char;
>     report "   lc" => $lc;
>     report "tc   " => $tc_good,   "real";
>     report "tc lc" => $tc_bad_lc, ($tc_good eq $tc_bad_lc) ? "RIGHT" :
> "WRONG";
>     report "tc uc" => $tc_bad_uc, ($tc_good eq $tc_bad_uc) ? "RIGHT" :
> "WRONG";
>     print "\n";
> }
> 

Can anyone provide a documentation patch?

Thank you very much.
Jim Keenan

---
via perlbug:  queue: perl5 status: new
https://rt.perl.org:443/rt3/Ticket/Display.html?id=96592



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About