develooper Front page | perl.perl5.changes | Postings from September 2019

[perl.git] branch smoke-me/khw-anyofr created.v5.31.3-201-g3f671efa44

From:
Karl Williamson
Date:
September 19, 2019 23:24
Subject:
[perl.git] branch smoke-me/khw-anyofr created.v5.31.3-201-g3f671efa44
Message ID:
E1iB5mh-0001xZ-3k@git.dc.perl.space
In perl.git, the branch smoke-me/khw-anyofr has been created

<https://perl5.git.perl.org/perl.git/commitdiff/3f671efa445b7c17c8b545f96d2ad6e011eac273?hp=0000000000000000000000000000000000000000>

        at  3f671efa445b7c17c8b545f96d2ad6e011eac273 (commit)

- Log -----------------------------------------------------------------
commit 3f671efa445b7c17c8b545f96d2ad6e011eac273
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 16:22:19 2019 -0600

    Add ANYOFRb regnode
    
    This is like the ANYOFR regnode added in the previous commit, but all
    code points in the range it matches are known to have the same first
    UTF-8 start byte.  That means it can't match UTF-8 invariant characters,
    like ASCII, because the "start" byte is different on each one, so it
    could only match a range of 1, and the compiler wouldn't generate this
    node for that; instead using an EXACT.
    
    Pattern matching can rule out most code points by looking at the first
    character of their UTF-8 representation, before having to convert from
    UTF-8.
    
    On ASCII this rules out all but 64 2-byte UTF-8 characters from this
    simple comparison.  3-byte it's up to 4096, and 4-byte, 2**18, so the
    test is less effective for higher code points.
    
    I believe that most UTF-8 patterns that otherwise would compile to
    ANYOFR will instead compile to this, as I can't envision real life
    applications wanting to match large single ranges.  Even the 2048
    surrogates all have the same first byte.

commit d23c7575b36dd35bec17947b835aeb878dd8e36b
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 16:05:06 2019 -0600

    Add ANYOFR regnode
    
    This matches a single range of code points.  It is both faster and
    smaller than other ANYOF-type nodes, requiring, after set-up, a single
    subtraction and conditional branch.
    
    The vast majority of Unicode properties match a single range, though
    most of these are not likely to be used in real world applications.  But
    things like [ij] are a single range, and those are quite commonly
    encountered.  This matches them more efficiently than a bitmap would,
    and doesn't require the space for one either.
    
    The flags field is used to store the minimum matchable start byte for
    UTF-8 strings, and is ignored for non-UTF-8 targets.  This, like ANYOFH
    nodes which have the same mechanism, allows for quick weeding out of
    many possible matches without having to convert the UTF-8 to its
    corresponding code point.
    
    This regnode packs the 32 bit argument with 20 bits for the minimum code
    point the node matches, and 12 bits for the maximum range.  Values
    outside those simply won't compile to this regnode, instead going to one
    of the ANYOFH flavors.  This is sufficient to match all of Unicode
    except for the final (private use) 65K plane.

commit 06d19438047bdc7019b9dfc6f7c85382c9c81961
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 16:04:03 2019 -0600

    regexec.c: Rmv some unnecessary casts
    
    The called macro does the cast, and this makes it more legibile

commit 1ba382328f2b84fb0b6e3dec534e4b94a0914a28
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 16:03:04 2019 -0600

    l

commit e307470474b0314590c60ed21896d783b001b75a
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 15:47:51 2019 -0600

    regcomp.c: Use variables initialized to macro results
    
    instead of the macros.  This is in preparation for the next commit.

commit c7d40f5b9c1aa5db8324262ebdce428a277b6a0d
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 14:20:59 2019 -0600

    regcomp.c: Add parameter to static function
    
    This further decouples this function from knowing details of the calling
    structure, by passing this detail in.

commit 55623ead428e6527bb23531a451909eea8c4249f
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 13:20:42 2019 -0600

    t/re/anyof.t: Add a test
    
    This makes sure a non-folding above-Latin1 character is tested.

commit 3a9c470fcccd0bfdeb09cca96dcab6103741603e
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 14:38:39 2019 -0600

    regcomp.c: Comments/white-space
    
    Included is outdenting code whose enclosing block was removed in the
    previous commit.

commit 8f4c71a4d30e80ae80c46877c913cd791ffef5f9
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 13:12:51 2019 -0600

    XXX warning tests,Prefer EXACTish regnodes to ANYOFH nodes
    
    ANYOFH nodes (that match code points above 255) are smaller than regular
    ANYOF nodes because they don't have a 256-bit bitmap.  But the
    disadvantage of them over EXACT nodes is that the characters encountered
    must first be converted from UTF-8 to code point.  The difference is
    less clearcut with /i, because typically, currently, the UTF-8 must also
    be converted to code point in order to fold them.  But the EXACTFish
    node doesn't have an inversion list to do lookup in, and occupies
    less space, because it doesn't have inversion list data attached to it.
    
    Also there is a bug in using ANYOFH under /l, as wide character warnings
    should be emitted if the locale isn't a UTF-8 one.

commit 465d5bf992430128d7977f603e3d215251a2001d
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 12:45:55 2019 -0600

    t/re/anyof.t: Fix highest range tests
    
    Previously we had infinity minus 1, but infinity should be beyond the
    range, and the highest isn't infinity - 1, but the highest legal code
    point.

commit 744bc2e985b22560a7a742569d2128c31000d21b
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 12:41:41 2019 -0600

    t/re/anyof.t: Remove duplicate test
    
    This is covered by the single code point tests.

commit 12a6b4d307ce737b4515b0714fd0f7d10368f42f
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 12:34:23 2019 -0600

    t/re/anyof.t: Remove invalid test
    
    One shouldn't be able to specify an infinite code point.  The tests have
    the conceit that one can specify a range's upper limit as infinity, but
    that is just shorthand for the range being unbounded.

commit 349aa3cbb3fe93e171b5b00a98365235299a7599
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 12:31:11 2019 -0600

    re/anyof.t: Clarify failing message
    
    When a test fails, an extra test is run to output debugging info; this
    will cause the planned number of tests to be wrong, which will output an
    extra, confusing message.  This adds an explanation that the number is
    expected to be wrong, hence not to worry.

commit d3f35546fc92fa86225a23b02d2636977b709c32
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 12 20:19:07 2019 -0600

    Allow some optimizations of qr/(?[...])/
    
    Prior to this commit, this construct always returned an ANYOF node, even
    if it could be optimized into something else.

commit 72cc33a64b846054aa82071093a9d6c5512c1685
Author: Karl Williamson <khw@cpan.org>
Date:   Thu May 30 20:57:27 2019 -0600

    regcomp.c: Add invlist_lowest()
    
    This function hides the invlist implementation from the calling code,
    and will be called in more than one place in the future.

commit fe96c9a9be08ca89d660da6c926a3fc899e7f27a
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 12 21:06:45 2019 -0600

    regcomp.c: Code for qr/(?[...]) handle restart
    
    There is an existing mechanism for code to realize it needs to restart
    parsing from the beginning, say because it needs to upgrade to UTF-8.
    The code for /(?[...])/ did not participate in this.  Currently I don't
    know of any case where it needs to, though perhaps some very hard to
    reproduce case when branch instructions need to start needing to handle
    more than 16 bits, but I kind of doubt it.  Anyway, the next few commits
    introduce the possibility.

commit 381ccc56fa4f1c2af7d023c697dbc458876bc52c
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jun 26 13:02:35 2019 -0600

    XXX Configure backtrace

commit 67eebd462a83809f2128e75b714b7ab6292d3770
Author: Karl Williamson <khw@cpan.org>
Date:   Sun Sep 15 16:08:13 2019 -0600

    regcomp.sym: Fix comment
    
    The length of an EXACTish node is the same bits as the FLAGS field in
    other nodes; it doesn't "precede the length", as previously claimed.

-----------------------------------------------------------------------

-- 
Perl5 Master Repository



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About