develooper Front page | perl.perl5.porters | Postings from November 2005

Re: possible [PATCH] for EBCDIC: IS_UTF8_CHAR()

From:
Jarkko Hietaniemi
Date:
November 30, 2005 04:37
Subject:
Re: possible [PATCH] for EBCDIC: IS_UTF8_CHAR()
Message ID:
438D9D0E.9090304@gmail.com
SADAHIRO Tomoyuki wrote:
>> So I decided to look into the matter, and attached is a patch that tries
>> to implement the IS_UTF8_CHAR() speedup also for (UTF-)EBCDIC platforms.
>>
>> There's a BIG caveat here, though: I only did the patch "on paper", by
>> glaring at the UTF-EBCDIC Unicode Technical Report, since I no more have
>> access to an EBCDIC platform.
> 
> I'm afraid your patch deosn't work.
> UTF-EBCDIC requires transformation of octet sequences of I8-Sequence
> (or UTF-8-Mod) using a reversible one-to-one mapping.

Ok, thanks, I thought I had to be assuming something too simple...
So forget my patch.

> #define IS_UTF8_CHAR_1(p)	\
>         (NATIVE_TO_UTF((p)[0]) <= 0x9F)

Hey, at least I got something right!

> #define IS_UTF8_CHAR_2(p)	\
>   (NATIVE_TO_UTF((p)[0]) >= 0xC5 && NATIVE_TO_UTF((p)[0]) <= 0xDF && \
>    NATIVE_TO_UTF((p)[1]) >= 0xA0 && NATIVE_TO_UTF((p)[1]) <= 0xBF)
> 
> Cf.
> (1) a regex matching a character in I8-Sequence (only shortest forms)
> 
> my $qr =
> qr/^(?:
>     [\x00-\x9F]                     # 00000000-0000009F
>   | [\xC5-\xDF][\xA0-\xBF]          # 000000A0-000003FF
>   | [\xE1-\xEF][\xA0-\xBF]{2}       # 00000400-00003FFF
>   | \xF0[\xB0-\xBF][\xA0-\xBF]{2}   # 00004000-00007FFF
>   | [\xF1-\xF7][\xA0-\xBF]{3}       # 00008000-0003FFFF
>   | \xF8[\xA8-\xBF][\xA0-\xBF]{3}   # 00040000-000FFFFF
>   | [\xF9-\xFB][\xA0-\xBF]{4}       # 00100000-003FFFFF
>   | \xFC[\xA4-\xBF][\xA0-\xBF]{4}   # 00400000-01FFFFFF
>   | \xFD[\xA0-\xBF][\xA0-\xBF]{4}   # 02000000-03FFFFFF
>   | \xFE[\xA2-\xBF][\xA0-\xBF]{5}   # 04000000-3FFFFFFF
>   | \xFF[\xA0-\xBF]{6}              # 40000000-7FFFFFFF
> )\z/x;
> 
> (2) a regex matching a character in UTF-EBCDIC for CP1047
>     (only shortest forms)

Given that the resulting state machines (regexes) are so complicated...

(1) they should be generated by machine, not by hand

(2) even if they are generated, will making an EBCDIC-compatible
    IS_UTF8_CHAR() macro speed up things at all to be worth it?

> my $qr =
> qr/^(?:
>    [\x00-\x40\x4B-\x50\x5A-\x61\x6B-\x6F\x79-\x7F] #\
>  | [\x81-\x89\x91-\x99\xA1-\xA9\xAD\xBD]           # } 00000000-0000009F
>  | [\xC0-\xC9\xD0-\xD9\xE0\xE2-\xE9\xF0-\xF9\xFF]  #/
>  | [\x80\x8A-\x90\x9A-\xA0\xAA-\xAC\xAE-\xB6]        # 000000A0-000003FF
>    [\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]
>  | [\xB8-\xBC\xBE-\xBF\xCA-\xCF\xDA-\xDB]            # 00000400-00003FFF
>    [\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]{2}
>  | \xDC[\x57-\x59\x62-\x6A\x70-\x73]                 # 00004000-00007FFF
>    [\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]{2}
>  | [\xDD-\xDF\xE1\xEA-\xEC]                          # 00008000-0003FFFF
>    [\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]{3}
>  | \xED[\x49-\x4A\x51-\x59\x62-\x6A\x70-\x73]        # 00040000-000FFFFF
>    [\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]{3}
>  | [\xEE-\xEF\xFA]                                   # 00100000-003FFFFF
>    [\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]{4}
>  | \xFB[\x45-\x4A\x51-\x59\x62-\x6A\x70-\x73]        # 00400000-01FFFFFF
>    [\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]{4}
>  | \xFC[\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]        # 02000000-03FFFFFF 
>    [\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]{4}
>  | \xFD[\x43-\x4A\x51-\x59\x62-\x6A\x70-\x73]        # 04000000-3FFFFFFF
>    [\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]{5}
>  | \xFE[\x41-\x4A\x51-\x59\x62-\x6A\x70-\x73]{6}     # 40000000-7FFFFFFF
> )\z/x;
> 
> Regards,
> SADAHIRO Tomoyuki
> 
> 
> 




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About