With Randy's tip and my discovery of the Unicode::Normalize module,
I've gotten things worked out.
use Unicode::Normalize qw(compose);
use Encode qw(decode_utf8);
...
my $f = decode_utf8(param('file'));
... write out the file itself with name in decomposed utf-8
$f = compose($f);
... now do something with filename in composed utf-8
etc.
Thanks to everyone who helped out. I'm not sure what to do with my day
now.
Andrew
On Apr 7, 2005, at 1:57 PM, Randy Boring wrote:
>> I've noticed that the non-ASCII characters are getting split into
>> their
>> base code points. For example, U+00E9, Latin small letter E with
>> acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf). Is
>> there a way to easily recombine the code points to get the original
>> value? It's strange to me that Encode::decode_utf8 doesn't do this.
>> I
>> thought diacritical marks were always combined with their preceding
>> letter, if possible.
>>
>> Andrew
>
> You've run into the particular format of HFS+ filenames. It's not just
> any utf-8 encoding, most all of the Unicode characters that are
> decomposable are decomposed, and must be so!
>
> In Apple's header files (CoreFoundation/CFStringEncodingExt.h), it's
> referred to as kUnicodeCanonicalDecompVariant.
> In NSString.h there are functions for
> decomposedStringWithCanonicalMapping (and precomposed- and
> -CompatabilityMapping). How you get to them from Perl, tho.... maybe
> CamelBones?
>
> A description of this text encoding (and the reason for it) are found
> at
> http://developer.apple.com/technotes/tn/tn1150.html
>
> see especially
> http://developer.apple.com/technotes/tn/tn1150.html#HFSPlusNames
> and
> http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
>
>
> Hope that helps a little,
>
> -Randy
>
Thread Previous
|
Thread Next