develooper Front page | perl.perl5.changes | Postings from September 2019

[perl.git] branch smoke-me/khw-anyofr created.v5.31.4-29-g61e79c9cca

From:
Karl Williamson
Date:
September 21, 2019 18:58
Subject:
[perl.git] branch smoke-me/khw-anyofr created.v5.31.4-29-g61e79c9cca
Message ID:
E1iBkaI-0002mA-2v@git.dc.perl.space
In perl.git, the branch smoke-me/khw-anyofr has been created

<https://perl5.git.perl.org/perl.git/commitdiff/61e79c9cca09ebbf8f85d6a41bc8b46e8a11c742?hp=0000000000000000000000000000000000000000>

        at  61e79c9cca09ebbf8f85d6a41bc8b46e8a11c742 (commit)

- Log -----------------------------------------------------------------
commit 61e79c9cca09ebbf8f85d6a41bc8b46e8a11c742
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Sep 21 12:24:59 2019 -0600

    handy.h: Avoid compiler warnings for withinCOUNT()
    
    If a parameter to this function is unsigned, gcc, at least, generates a
    comparison-always-true warning for the asserts on the parameters.
    Silence these by casting to an NV.  Any extra machine instructions will
    be gone from non-DEBUGGING builds.  The value in an NV won't necessarily
    be exact, but all the assertions care about is the sign, which is
    guaranteed by C11 standard 6.3.1.4 item 2.
    
    This technique was the idea of Tomasz Konojacki.

commit e8b53ab65034bbc7f719be77a97244717f824544
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Sep 21 12:23:49 2019 -0600

    handy.h: Rmv duplicated assert in inRANGE()
    
    This assertion is done in the macro that is called to do the real work.

commit 6742dc8813ec412d8b2828978f5a6413c14d730e
Author: Karl Williamson <khw@cpan.org>
Date:   Fri Sep 20 09:51:13 2019 -0600

    Move Perl_regnext to regexec.c
    
    This function is moved to the file that calls it incessantly in real
    time from regcomp.c that uses it in compilation, which experience has
    shown can be less efficient and doesn't affect the overall performance.
    
    Now the compiler has full knowledge of this function in the translation
    unit that performance is critical in, and can hopefully perform better
    optimizations.

commit 4578802d9bb86268c6175b1860cfd731ae17ae1a
Author: Karl Williamson <khw@cpan.org>
Date:   Fri Sep 20 09:45:29 2019 -0600

    regnext: Add some branch predictor hints

commit 80dbde9c00ff97acb873e942e20d605b2324bee7
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 22:18:02 2019 -0600

    Change data lookup from a macro to a function

commit 1b6f8d1f05afdb97bbefb9aa8dae855569f45578
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 21:54:03 2019 -0600

    regen/regcomp.pl: Enforce all lonj nodes being last

commit 48bcf4f0db54b4b1da2819540824895b0352eeed
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 20:34:17 2019 -0600

    regcomp.sym: Move regnodes to end that don't use next_off
    
    Most regnodes use the next_off field in a regnode structure, to link to
    the next one in the chain.  But some require more than the 16 bits it
    contains, so they use a different, 32 bit, field.
    
    Currently, there is a lookup array to distinguish between the types, but
    that becomes unnecessary if all of one sort are grouped before or after
    all of the other.

commit e14f7abefec4acb737b8d1c80a652edbb72fb447
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Sep 21 09:51:52 2019 -0600

    Add ANYOFRb regnode
    
    This is like the ANYOFR regnode added in the previous commit, but all
    code points in the range it matches are known to have the same first
    UTF-8 start byte.  That means it can't match UTF-8 invariant characters,
    like ASCII, because the "start" byte is different on each one, so it
    could only match a range of 1, and the compiler wouldn't generate this
    node for that; instead using an EXACT.
    
    Pattern matching can rule out most code points by looking at the first
    character of their UTF-8 representation, before having to convert from
    UTF-8.
    
    On ASCII this rules out all but 64 2-byte UTF-8 characters from this
    simple comparison.  3-byte it's up to 4096, and 4-byte, 2**18, so the
    test is less effective for higher code points.
    
    I believe that most UTF-8 patterns that otherwise would compile to
    ANYOFR will instead compile to this, as I can't envision real life
    applications wanting to match large single ranges.  Even the 2048
    surrogates all have the same first byte.

commit 32e500677286d4c3674ffa03d62ce7613091b45a
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 16:03:04 2019 -0600

    Add ANYOFR regnode
    
    This matches a single range of code points.  It is both faster and
    smaller than other ANYOF-type nodes, requiring, after set-up, a single
    subtraction and conditional branch.
    
    The vast majority of Unicode properties match a single range, though
    most of these are not likely to be used in real world applications.  But
    things like [ij] are a single range, and those are quite commonly
    encountered.  This matches them more efficiently than a bitmap would,
    and doesn't require the space for one either.
    
    The flags field is used to store the minimum matchable start byte for
    UTF-8 strings, and is ignored for non-UTF-8 targets.  This, like ANYOFH
    nodes which have the same mechanism, allows for quick weeding out of
    many possible matches without having to convert the UTF-8 to its
    corresponding code point.
    
    This regnode packs the 32 bit argument with 20 bits for the minimum code
    point the node matches, and 12 bits for the maximum range.  Values
    outside those simply won't compile to this regnode, instead going to one
    of the ANYOFH flavors.  This is sufficient to match all of Unicode
    except for the final (private use) 65K plane.

commit 68fce4be73292ccb7562ce2323599ea3f9cb329a
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 16:04:03 2019 -0600

    regexec.c: Rmv some unnecessary casts
    
    The called macro does the cast, and this makes it more legibile

commit 39afa0b34edeeddbf81d39e1f3cc850c7a680a05
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 15:47:51 2019 -0600

    regcomp.c: Use variables initialized to macro results
    
    instead of the macros.  This is in preparation for the next commit.

commit 6ca09dd248f23ff8080e12d6e0953e32a9a724c5
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 14:20:59 2019 -0600

    regcomp.c: Add parameter to static function
    
    This further decouples this function from knowing details of the calling
    structure, by passing this detail in.

commit a17fff09f7b7bc1dad24240e222872f1ca53e591
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 13:20:42 2019 -0600

    t/re/anyof.t: Add a test
    
    This makes sure a non-folding above-Latin1 character is tested.

commit 793530e5f0a893ee7c30f3f30d52fb518dbde3e3
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 19 14:38:39 2019 -0600

    regcomp.c: Comments/white-space
    
    Included is outdenting code whose enclosing block was removed in the
    previous commit.

commit 0b090ae59d53fd2142528cb4e75103cceced1adc
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 13:12:51 2019 -0600

    XXX warning tests,Prefer EXACTish regnodes to ANYOFH nodes
    
    ANYOFH nodes (that match code points above 255) are smaller than regular
    ANYOF nodes because they don't have a 256-bit bitmap.  But the
    disadvantage of them over EXACT nodes is that the characters encountered
    must first be converted from UTF-8 to code point.  The difference is
    less clearcut with /i, because typically, currently, the UTF-8 must also
    be converted to code point in order to fold them.  But the EXACTFish
    node doesn't have an inversion list to do lookup in, and occupies
    less space, because it doesn't have inversion list data attached to it.
    
    Also there is a bug in using ANYOFH under /l, as wide character warnings
    should be emitted if the locale isn't a UTF-8 one.
    
    The reason this change hasn't been made before (by me anyway) is that
    the old way avoided upgrading the pattern to UTF-8.  But having thought
    about this for a long time, to match this node, the target string must
    be in UTF-8 anyway, and having a UTF8ness mismatch slows down pattern
    matching, as things have to be continually converted, and reconverted
    after backtracking.

commit ce58ff8f5fccf78bd6eb31434b0355879fb4f08c
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 12:45:55 2019 -0600

    t/re/anyof.t: Fix highest range tests
    
    Previously we had infinity minus 1, but infinity should be beyond the
    range, and the highest isn't infinity - 1, but the highest legal code
    point.

commit f5885118e8fa7774b1c8665ba7f95a77faaac807
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 12:41:41 2019 -0600

    t/re/anyof.t: Remove duplicate test
    
    These are covered by the single code point tests.

commit 6bce510e309bd39feac4f31d77baaab809bc6a75
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 12:34:23 2019 -0600

    t/re/anyof.t: Remove invalid test
    
    One shouldn't be able to specify an infinite code point.  The tests have
    the conceit that one can specify a range's upper limit as infinity, but
    that is just shorthand for the range being unbounded.

commit bbbeafbaa84aad6d251b7a7e2e4f088e8d81334a
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Sep 21 10:00:40 2019 -0600

    t/re/anyof.t: Revise test
    
    to make it correspond more with the test that precedes it

commit 172234ae211558a15cca5267538bc7896ad71588
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 18 12:31:11 2019 -0600

    re/anyof.t: Clarify failing message
    
    When a test fails, an extra test is run to output debugging info; this
    will cause the planned number of tests to be wrong, which will output an
    extra, confusing message.  This adds an explanation that the number is
    expected to be wrong, hence not to worry.

commit b79281936953271fc0bfe1f82ea9b1c9ba67bcf1
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 12 20:19:07 2019 -0600

    Allow some optimizations of qr/(?[...])/
    
    Prior to this commit, this construct always returned an ANYOF node, even
    if it could be optimized into something else.

commit e1806218594bc839621053978e3337451f5b95d9
Author: Karl Williamson <khw@cpan.org>
Date:   Thu May 30 20:57:27 2019 -0600

    regcomp.c: Add invlist_lowest()
    
    This function hides the invlist implementation from the calling code,
    and will be called in more than one place in the future.

commit 09ee51f694d263179f61edc3cc5a4bbdda062299
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 12 21:06:45 2019 -0600

    regcomp.c: Code for qr/(?[...]) handle restart
    
    There is an existing mechanism for code to realize it needs to restart
    parsing from the beginning, say because it needs to upgrade to UTF-8.
    The code for /(?[...])/ did not participate in this.  Currently I don't
    know of any case where it needs to, though perhaps some very hard to
    reproduce case when branch instructions need to start needing to handle
    more than 16 bits, but I kind of doubt it.  Anyway, the next few commits
    introduce the possibility.

commit 37dbfde0fd87d4d0f0f6f5372611951b5d6a9217
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Jun 26 13:02:35 2019 -0600

    XXX Configure backtrace

-----------------------------------------------------------------------

-- 
Perl5 Master Repository



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About