develooper Front page | perl.perl5.porters | Postings from September 2000

regexp/utf8 spec

Thread Next
September 17, 2000 16:13
regexp/utf8 spec
Message ID:
I'm trying to work out what is the intended specification for the
interactions between regexps and UTF8 [0]. For this purpose I'll
refer to 8-bits-per-byte strings as 'binary' and utf8 strings as
'text'. When referring to '$string =~ /$re/', I'll refer to $string
as the 'target' and $re as the 'pattern'.

1. We compile a pattern such that it may be used to match either text
or binary. When compiling we ignore whether we are in a 'use utf8'
scope, but we need to know how the pattern string is stored to know
how to interpret the pattern [1].

2. Since the pattern does not know what type of data it will be asked
to match, each binary/text pair of regexp node types (eg ANYOF and
ANYOFUTF8) should be collapsed to a single node type.

3. It is probably a useful optimisation for any node that refers to a
string (eg EXACT) to be expanded to hold both the binary and text
forms of that string. In that model, we'd initially fill only the
slot appropriate to the form of the pattern, and lazily fill the
other slot if needed while matching.

4. When matching, the nature of the target determines whether we
perform a binary match or a text match.

5. Magic variables that represent a matched substring will have the
same type as the target.

6. Magic variables that represent offsets into the target (@+ etc)
should be useful for substr() and the like.

Does that look right so far? Have I missed any major issues?

[0] I don't know a lot about this stuff, but I use 'UTF8' to encompass
any representation that does not always use 8 bits per character.

[1] Currently Perl_pregcomp is passed char*s pointing to the beginning
and end of the string. I propose that it should receive an SV* instead.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About