develooper Front page | perl.perl5.porters | Postings from May 2003

another attempt at adding unicode regex support to perl

Thread Next
From:
Jeff 'japhy' Pinyan
Date:
May 27, 2003 19:28
Subject:
another attempt at adding unicode regex support to perl
Message ID:
Pine.LNX.4.44.0305272118270.14148-100000@perlmonk.org
I've been trying to add Unicode's regex charclass-magic to Perl (again),
and I think I've run into the same problem as before.

Here's the situation... [\p{IsAlpha}\p{IsAlnum}] yields the info string
"+utf8::IsAlpha +utf8::IsAlnum".  There are handlers for those in
utf8_heavy.pl.  But [\p{IsAlpha}&&\p{IsAlnum}] (hypothetical syntax for
the intersection of the two classes) has no way of being represented in
the swash info string.  So that means I need to come up with more sigils.

Here's what I have in mind:

SIGIL	MEANING
+	do     match X if X is     in this class
!	do     match X if X is NOT in this class
-	do NOT match X if X is     in this class
&	do NOT match X if X is NOT in this class

This additional sigil is the only one I've made up.  It would make the
hypothetical class be represented by "+utf8::IsAlpha &utf8::IsAlnum".

This solves the problem of &&'ing with negated classes.  Let's say
\p{AtoZ} is [a-z] and \p{Vowels} is [aeiou].  [\p{AtoZ}&&\P{Vowels}] is
represented by "+utf8::AtoZ -utf8::Vowels".

Also, [\p{AtoZ}&&\P{Vowels}] can be written hypothetically as
[\p{AtoZ}&&[^\p{Vowels}]], which can be hypothetically written as
[\p{AtoZ}^^\p{Vowels}].  I hope this makes sense.

There's one last problem... the \w (etc) macros.  They're stored in their
own manner, not using these UTF strings.  I think I need to add logic for
them as well in the reginclass() function that will require an additional
set of tests like

  if (match && (
    (ANYOF_CLASS_TEST(n, ANYOF_MDIGIT)  && !isDIGIT_LC(c)) ||
    (ANYOF_CLASS_TEST(n, ANYOF_MNDIGIT) && isDIGIT_LC(c))  ||
    ...
  )) match = FALSE;

where ANYOF_MDIGIT and ANYOF_MNDIGIT stand for "must be a digit" and "must
not be a digit".


Below is a simple table of equivalencies for the \p stuff.  I think it is
correct.  If anyone sees an error, let me know.

\p{AtoZ}	[a-z]
\P{AtoZ]	[^a-z]
\p{Vowels}	[aeiou]
\P{Vowels}	[^aeiou]


\p{AtoZ}    \p{Vowels}	"+utf8::AtoZ +utf8::Vowels"
			[a-z] [aeiou]
			[a-z]

\p{AtoZ} && \p{Vowels}	"+utf8::AtoZ &utf8::Vowels"
			[a-z] && [aeiou]
			[aeiou]

\p{AtoZ} ^^ \p{Vowels}	"+utf8::AtoZ -utf8::Vowels"
			[a-z] ^^ [aeiou]
			[b-df-hj-np-tv-z]

\p{AtoZ}    \P{Vowels}	"+utf8::AtoZ !utf8::Vowels"
			[a-z] [^aeiou]
			[\000-\377]

\p{AtoZ} && \P{Vowels}	"+utf8::AtoZ -utf8::Vowels"
			[a-z] && [^aeiou]
			[b-df-hj-np-tv-z]

\p{AtoZ} ^^ \P{Vowels}	"+utf8::AtoZ &utf8::Vowels"
			[a-z] ^^ [^aeiou]
			[aeiou]

-- 
Jeff "japhy" Pinyan      japhy@pobox.com      http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
<stu> what does y/// stand for?  <tenderpuss> why, yansliterate of course.
[  I'm looking for programming work.  If you like my work, let me know.  ]



Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About