-
-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Minimisation should not combine end states with distinct sets of endids.
This was adapted from the group capture integration branch. It isn't specific to group capture. Currently, metadata on end states will not prohibit merging, which can lead to false positives when using endids to detect which of the original regexes are matching when several regexes are compiled and then unioned into a single DFA. This adds a pass before the main minimisation loop that splits the ECs of end states with distinct sets of end IDs, which is sufficient to prevent them from merging later. I reworked the test tests/endids/endids2_union_many_endids.c somewhat, since it was previously checking that the end states WERE merged. Unfortunately the other checks are somewhat weakened -- I removed the part checking that `endids[j] == j+1` because that ordering isn't guaranteed anymore: the patterns aren't anchored, so some examples will match more than one of them, and the endids collected can be nonconsecutive. I added a separate test, endids10_minimise_partial_overlap.c, which checks the minimal case that when the two regexes /^abc$/ and /^ab*c$/ are combined and a distinct endid is set on each, matching "ac" "abc" "abbc" gets one or both endids as appropriate. Ideally, we'd have a property test that checks that any arbitrary set of regex patterns has the same set of inputs matching -> endids when they are run individually vs. when combined into a single FSM, determinised, and minimised.
- Loading branch information
1 parent
fdd18f4
commit 455b4d1
Showing
3 changed files
with
412 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.