On May 29, Jeff 'japhy' Pinyan said: >On May 27, Jeff 'japhy' Pinyan said: > >>I've been trying to add Unicode's regex charclass-magic to Perl (again), >>and I think I've run into the same problem as before. > >I've hit a snag. If a charclass has intersection or subtraction in it, >and locale is on or \p{...} classes are used, the charclass must be >represented ENTIRELY as those "+utf8::XXX" strings. Here's why. > >If locale is on, then a charclass like [[\w&&[\d]][aeiou]] will have to be >represented as "+utf8::IsAlnum &utf::IsDigit +utf8::Is_aeiou_" (or >something like that), because (since locale is on) \w doesn't modify the >charclass's bitmap array, but just turns on the ANYOF_ALNUM flag. Since >precedence is an issue, we can't just check flags. > >This will mean we'll be suffering some inefficiency (but that should be >expected with Unicode right now, right?). It also means I need to come up >with on-the-fly Unicode classes that match a specific set of characters I >decide on at that moment. What's the easiest way to do that? I need to >know this to get intersection and subtraction working. I found another problem with utf8_heavy.pl's operation. It separates the +utf8::xxx lines from the 0123\t0456 lines when evaluating whether a character matches a UTF description. Example: sub InJaphy { << "END"; +utf8::IsAlnum -utf8::IsDigit 0031\t0033 END } should match any alphanum that ISN'T a digit, except that 1, 2, and 3 are ok to match. It should match "abc123def" out of "abc123def456ghi", but it doesn't, because the 0031\t0033 line is placed in the swash first, and THEN the +utf8 and -utf8 lines are parsed. I don't think this is defined well anywhere in the docs (that I could find). I can fix this (and I think it should be fixed). Does anyone agree with me? -- Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/ RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/ <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course. [ I'm looking for programming work. If you like my work, let me know. ]Thread Previous