develooper Front page | perl.perl5.changes | Postings from October 2020

[Perl/perl5] 6c5474: regexp.h: White-space only

From:
Karl Williamson via perl5-changes
Date:
October 16, 2020 13:02
Subject:
[Perl/perl5] 6c5474: regexp.h: White-space only
Message ID:
Perl/perl5/push/refs/heads/blead/c6565d-b18261@github.com
  Branch: refs/heads/blead
  Home:   https://github.com/Perl/perl5
  Commit: 6c5474e85cc9caea7746f3e3beeddc049603763c
      https://github.com/Perl/perl5/commit/6c5474e85cc9caea7746f3e3beeddc049603763c
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regexp.h

  Log Message:
  -----------
  regexp.h: White-space only

Indent preprocessor lines for clarity of program structure


  Commit: 69ffc8e3437c15ea4dbf61156c07656d09603ed5
      https://github.com/Perl/perl5/commit/69ffc8e3437c15ea4dbf61156c07656d09603ed5
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regen/unicode_constants.pl
    M unicode_constants.h

  Log Message:
  -----------
  regen/unicode_constants.pl: Add a couple constants

which will be needed in a future commit


  Commit: 3b58492077941aecb1bb81af35f727992854262c
      https://github.com/Perl/perl5/commit/3b58492077941aecb1bb81af35f727992854262c
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcomp.c

  Log Message:
  -----------
  regcomp.c: Clarify comment


  Commit: 5f162c354443de3e4d8e95f01acd019ab5bf32a9
      https://github.com/Perl/perl5/commit/5f162c354443de3e4d8e95f01acd019ab5bf32a9
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M pod/perldebguts.pod
    M regcomp.sym
    M regnodes.h

  Log Message:
  -----------
  regcomp.sym: Update node comments


  Commit: f97d9711c850a2acc3e6bc7156ce1e23f957b460
      https://github.com/Perl/perl5/commit/f97d9711c850a2acc3e6bc7156ce1e23f957b460
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M pod/perldebguts.pod
    M regcomp.sym
    M regnodes.h

  Log Message:
  -----------
  regcomp.sym: Make adjacent opcodes for 2 similar regnodes

These are often tested together.  By making them adjacent we can use
inRANGE.


  Commit: a234542cd411731277682fb13ab4cf77e841b134
      https://github.com/Perl/perl5/commit/a234542cd411731277682fb13ab4cf77e841b134
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcomp.c

  Log Message:
  -----------
  regcomp.c: Simplify

The previous commit made the opcodes for two regops adjacent, so that we
can refer to them by a single range.  This commit takes  advantage of
that change.


  Commit: 938090acbdbd9475a044786f75bbbcf4e64d3b49
      https://github.com/Perl/perl5/commit/938090acbdbd9475a044786f75bbbcf4e64d3b49
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M globvar.sym
    M pod/perldebguts.pod
    M regcomp.sym
    M regen/regcomp.pl
    M regnodes.h

  Log Message:
  -----------
  regnodes.h: Add two convenience bit masks

These categorize the many types of EXACT nodes, so that code can refer
to a particular subset of such nodes without having to list all of them
out.  This simplifies some 'if' statements, and makes updating things
easier.


  Commit: 95f5a9192aec499cfcb88b39a66919cc67ed6c7d
      https://github.com/Perl/perl5/commit/95f5a9192aec499cfcb88b39a66919cc67ed6c7d
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcomp.c
    M regexec.c

  Log Message:
  -----------
  regcomp.c,regexec.c: Simplify

This commit uses the new macros from the previous commit to simply come
code.


  Commit: ad07094c004f4fa3ec82a1dee333e36963824ad1
      https://github.com/Perl/perl5/commit/ad07094c004f4fa3ec82a1dee333e36963824ad1
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcomp.c

  Log Message:
  -----------
  regcomp.c: Simplify

This was a case statement of every type of EXACTish node.  Instead,
there is a simple way to see if something is EXACTish.


  Commit: 8c112bb9153a54e615913d6c26876fb488703762
      https://github.com/Perl/perl5/commit/8c112bb9153a54e615913d6c26876fb488703762
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcharclass.h
    M regen/regcharclass.pl

  Log Message:
  -----------
  regen/regcharclass.pl: Change member to method

This will allow more flexibility in future commits to instead of using a
static format, to use one based on the input value.

The only non-white space change from this commit, is the reordering of a
couple tests; I'm not sure why that happened.


  Commit: e272994f04d07d5d7e5aecfa9e38a75f253ae5ce
      https://github.com/Perl/perl5/commit/e272994f04d07d5d7e5aecfa9e38a75f253ae5ce
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcharclass.h
    M regen/regcharclass.pl

  Log Message:
  -----------
  regen/regcharclass.pl: Move parameter to caller

This commit changes a sub in this file to be passed a new parameter.
This is in preparation for the value to be used in the caller.  No need
to derive it twice.


  Commit: fdc26d940a357441833197cb1b9b1d9a4420638e
      https://github.com/Perl/perl5/commit/fdc26d940a357441833197cb1b9b1d9a4420638e
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcharclass.h
    M regen/regcharclass.pl

  Log Message:
  -----------
  regen/regcharclass.pl: Use char instead of hex

This changes the generated macros to use a printable character or
mnemonic instead of a hex value.  This makes the macros easier to read.


  Commit: cf9d46fde42340bce59de919e4518881c97b3a85
      https://github.com/Perl/perl5/commit/cf9d46fde42340bce59de919e4518881c97b3a85
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcharclass.h
    M regcomp.c
    M regen/regcharclass_multi_char_folds.pl

  Log Message:
  -----------
  regcharclass.h: multi-folds: Add some unfoldeds

Prior to this commit, the generated macros for dealing with multi-char
folds in UTF-8 strings only recognized completely folded strings.  This
commit changes that to add the uppercase for characters in the Latin1
range.  Hopefully an example will clarify.

The fold for U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE is 'i'
followed by U+0307: COMBINING DOT ABOVE.  But since we are doing /i
matching, an 'I' followed by U+307 should also match.  This commit
changes the macros to know this.  Before this, if the fold were entirely
ASCII, the macros would know all the possible combinations.  This commit
extends that to all code points < 256.  (Since there are no folds to the
upper latin1 range), that really means all code points below 128.  But
making it general means it wouldn't have to be revised if a fold were
ever added to the upper half range.)

The reason to make this change is that it makes some future code less
complicated.  And it adds very little complexity to the generated
macros; less than the code it will save.  I originally thought it would
be more complext than it now turns out to be.  Much of that is because
the infrastructure has advanced since that decision.

I couldn't find any current places that this change will allow to be
simplified.  There could be if the macros were extended to do this on
all code points, not just the low ones.  I tried that, but the generated
macros were at least 400 lines longer than before.  That does add
significant complexity, so I backed that out.


  Commit: 70dc0cf11d00e208b9cf7abd3d31a83e245d2b5c
      https://github.com/Perl/perl5/commit/70dc0cf11d00e208b9cf7abd3d31a83e245d2b5c
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcharclass.h
    M regen/regcharclass_multi_char_folds.pl

  Log Message:
  -----------
  regen/regcharclass_multi_char_folds.pl: White space, comment only

Outdent and remove lines from changes in the previous commit.


  Commit: 114fc8b6cf6259d91d5d2c5cf7509f3f5e8cf35b
      https://github.com/Perl/perl5/commit/114fc8b6cf6259d91d5d2c5cf7509f3f5e8cf35b
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcharclass.h
    M regen/regcharclass_multi_char_folds.pl

  Log Message:
  -----------
  regen/regcharclass_multi_char_folds.pl: Use case fold

Prior to this commit, only the upper case of Latin1 characters was dealt
with.  But we really want case folding, and there are a few other
characters that fold to Latin1.  This commit acknowledges them.


  Commit: ef06e9363645bc516c39d68fefe501585464b2e2
      https://github.com/Perl/perl5/commit/ef06e9363645bc516c39d68fefe501585464b2e2
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcharclass.h
    M regen/regcharclass.pl

  Log Message:
  -----------
  regen/regcharclass.pl: Rmv unused macro


  Commit: 4fad5f9fde90649f8a92ff93e775cf814b118f19
      https://github.com/Perl/perl5/commit/4fad5f9fde90649f8a92ff93e775cf814b118f19
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcharclass.h
    M regen/regcharclass.pl

  Log Message:
  -----------
  regen/regcharclass.pl: White space only

This does some line wrapping, etc


  Commit: 59142b8bd98e53318226c235b25118b63b24c99f
      https://github.com/Perl/perl5/commit/59142b8bd98e53318226c235b25118b63b24c99f
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M charclass_invlists.h
    M lib/unicore/uni_keywords.pl
    M regen/mk_invlists.pl
    M uni_keywords.h

  Log Message:
  -----------
  charclass_invlists.h: Add some inverse folds.

The MICRO SIGN folds to above the Latin1 range, the only character that
does so in Unicode (or ever likely to).  This requires special handling.
This commit reduces some of the need for that handling by creating the
inversion map for it, which can be used in certain instances in pattern
matching, without having to have a special case.  The actual use of this
will come in a future commit.


  Commit: fa374e04d2e5a2ced966b6becb893db92d1030ec
      https://github.com/Perl/perl5/commit/fa374e04d2e5a2ced966b6becb893db92d1030ec
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c: Rename local variable; change type

I found myself getting confused, as this most likely was named before
UTF-8 came along.  It actually is just a byte, plus an out-of-bounds
value.

While I'm at it, I'm also changing the type from I32, to the perl
equivalent of the C99 'int_fast16_t', as it doesn't need to be 32 bits,
and we should let the compiler choose what size is the most efficient
that still meets our needs.


  Commit: f5c1b2d841363d1e077a3d27bc7721ad9c0eaf0d
      https://github.com/Perl/perl5/commit/f5c1b2d841363d1e077a3d27bc7721ad9c0eaf0d
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c: Change variable name in a function

This makes it like a corresponding variable.


  Commit: 4414955b8d69f301cec98246b177ffcc2eb9b061
      https://github.com/Perl/perl5/commit/4414955b8d69f301cec98246b177ffcc2eb9b061
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c: Store expression in a variable

This makes the text look cleaner, and prepares for a future commit,
where we will want to change the variable (which can't be done with the
expression).


  Commit: b1826163632422d276c89895546bd113c8f2cfe6
      https://github.com/Perl/perl5/commit/b1826163632422d276c89895546bd113c8f2cfe6
  Author: Karl Williamson <khw@cpan.org>
  Date:   2020-10-16 (Fri, 16 Oct 2020)

  Changed paths:
    M regcomp.c

  Log Message:
  -----------
  regcomp.c: Do some extra folding

Generally we have to wait until runtime to do folding for regnodes that
are locale dependent, because we don't know what the locale at runtime
will be, and hence what the folds will be.

But UTF-8 locales all have the same folding behavior, no matter what the
locale is, with the exception of two fold pairs in Turkish.  (Lithuanian
too, but Perl doesn't support that language's special folding rules.)
UTF-8 is the only locale type that Perl supports that can represent code
points above 255.  Therefore we do know at compile time what the
above-255 folds are (again excepting the two in Turkish), and so we can
do the folding then.  But only if both the components are above 255.
There are a few folds that cross the 255/256 boundary, and they must be
deferred.

However, there are two instances where there are three characters that
fold together in which two of them are above 255, and the third isn't.
That the two high ones are equivalent under /i is known at compile time,
and so that equivalence can be stated then.


Compare: https://github.com/Perl/perl5/compare/c6565d4b08aa...b18261636324



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About