develooper Front page | perl.perl5.porters | Postings from March 2007

[PATCH] Re: [PATCH] lib/Pod/Html.pm plus a funky UT8-8 regex bug

Thread Previous | Thread Next
From:
Jarkko Hietaniemi
Date:
March 21, 2007 05:01
Subject:
[PATCH] Re: [PATCH] lib/Pod/Html.pm plus a funky UT8-8 regex bug
Message ID:
46011E8A.1080009@iki.fi
>> when matching with $text as 'Mat<!>':
> 
> I dont see how to replicate these results to investigate further.

Yeah, that's the wonderful thing about locales: each vendor has their
own definitions...

> Could you help me out by giving me the debug output from:
> 
>   'Mat<!>'=~/[[:punct:]\s]+/
> 
> under both cases please?
> 
> and or
> 
> $str='Mat<!>';
> $str=~s/[[:punct:]\s]+//g;
> 
> under use locale and not as well?

Without locale no difference, with locale a difference:

@@ -6,8 +6,8 @@
 stclass ANYOF{loc}[\s[:punct:]+utf8::IsPunct +utf8::IsSpacePerl] plus
minlen 1  Matching REx "[[:punct:]\s]+" against "Mat<!>"
 Matching stclass ANYOF{loc}[\s[:punct:]+utf8::IsPunct
+utf8::IsSpacePerl] against "Mat<!>" (6 chars)
-   3 <Mat> <<!>>             |  1:PLUS(14)
-                                  ANYOF{loc}[\s[:punct:]+utf8::IsPunct
+utf8::IsSpacePerl] can match 3 times out of 2147483647...
-   6 <Mat<!>> <>             | 14:  END(0)
+   4 <Mat<> <!>>             |  1:PLUS(14)+
      ANYOF{loc}[\s[:punct:]+utf8::IsPunct +utf8::IsSpacePerl] can match
1 times out of 2147483647...
+   5 <Mat<!> <>>             | 14:  END(0)
 Match successful!
 Freeing REx: "[[:punct:]\s]+

But the 'use locale' actually rang a bell for me -- that's what
Pod::Html was using and I hadn't noticed, so I repeated my [[:punct:]]
test:

env LC_ALL=fi_FI.UTF-8 ./perl -Ilib -e 'for(0..127){$c=chr($_);print $c
if $c =~ /[[:punct:]]/};print "\n"'|hex
00000010 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 3a !"#$%&'()*+,-./:
00000020 3b 3c 3d 3e 3f 40 5b 5c 5d 5e 5f 60 7b 7c 7d 7e ;<=>?@[\]^_`{|}~
00000021 0a

env LC_ALL=fi_FI.UTF-8 ./perl -Ilib -Mlocale -e
'for(0..127){$c=chr($_);print $c if $c =~ /[[:punct:]]/};print "\n"'|hex
00000010 21 22 23 25 26 27 28 29 2a 2c 2d 2e 2f 3a 3b 3f !"#%&'()*,-./:;?
00000018 40 5b 5c 5d 5f 7b 7d 0a                         @[\]_{}.

So with a UTF-8 locale and -Mlocale, the [$+<=>^`~] are *not*
[[:punct:]] -- in Tru64 (I tested a couple of other UTF-8 locales
in addition to Finnish).

So nothing wrong in the regex engine -- either the test needs adjusting
to use something else than '<>', but then we are playing the game of
"guess which characters are [[:punct:]] in which locale", or the
fragment_id_readable() needs fortifying, as I did in the patch.

But please find attached an even more fortified version of the patch,
where I explicitly use [^A-Za-z0-9_] instead of relying on \W.

> I cant see any reason that this doesnt work as expected. And when i
> try it here it does work as expected. :-()
> 
> Cheers,
> Yves
> 
> 


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About