Fix JDK regex support #888

Stephan202 · 2023-11-03T07:31:42Z

Summary of changes:

Fix the test resources introduced by Resolves improper anchoring of patternProperties #783 by moving the regex fields, such that the test framework does not skip them with a "Not a valid test case" message.
Revert the changes introduced by Corrects Java's failure to match an end anchor when using find() #815, as those are simply incorrect.
Extend the test coverage introduced by Corrects Java's failure to match an end anchor when using find() #815 by (a) updating the test regexes to match their intended semantics and (b) include a few negative test cases.
Partially revert the change introduced by Resolves improper anchoring of patternProperties #783: the use of Matcher#find() is correct, but the hasStartAnchor and hasEndAnchor logic introduces more bugs than the issue it aims to solve.
Extend the test coverage introduced by Resolves improper anchoring of patternProperties #783, by introducing regexes that are not covered by the hasStartAnchor/hasEndAnchor logic.
Update the Joni regular expression integration such that it passes more of the test cases.
Disable the "trailing newline" test cases, as these are currently not handled correctly by either regex implementation.

Stephan202

Added some context. Still thinking about the "trailing newline" cases, but didn't find any solution yet.

This PR resolves #872.

Suggested commit message:

Fix JDK regex support

Summary of changes:
- Fix the test resources introduced by #783 by moving the `regex`
  fields, such that the test framework does not skip them with a "Not a
  valid test case" message.
- Revert the changes introduced by #815, as those are simply incorrect.
- Extend the test coverage introduced by #815 by (a) updating the test
  regexes to match their intended semantics and (b) include a few 
  negative test cases.
- Partially revert the change introduced by #783: the use of
  `Matcher#find()` is correct, but the `hasStartAnchor` and 
  `hasEndAnchor` logic introduces more bugs than the issue it aims to
  solve.
- Extend the test coverage introduced by #783, by introducing regexes
  that are not covered by the `hasStartAnchor`/`hasEndAnchor` logic.
- Update the Joni regular expression integration such that it passes
  more of the test cases.
- Disable the "trailing newline" test cases, as these are currently not 
  handled correctly by either regex implementation.

Resolves #872.

Stephan202 · 2023-11-03T07:32:55Z

src/main/java/com/networknt/schema/regex/JDKRegularExpression.java

    }

    @Override
    public boolean matches(String value) {
-        Matcher matcher = this.pattern.matcher(value);
-        return matcher.find() && (!this.hasStartAnchor || 0 == matcher.start()) && (!this.hasEndAnchor || matcher.end() == value.length());
+        return this.pattern.matcher(value).find();


This matches the original suggestion by @tvahrst in #782, and seems to be closest to what the spec intents (modulo the newline case).

Stephan202 · 2023-11-03T07:33:24Z

src/main/java/com/networknt/schema/regex/JoniRegularExpression.java

@@ -21,7 +21,7 @@ class JoniRegularExpression implements RegularExpression {
            .replace("\\S", "[^ \\f\\n\\r\\t\\v\\u00a0\\u1680\\u2000-\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000\\ufeff]");

        byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
-        this.pattern = new Regex(bytes, 0, bytes.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
+        this.pattern = new Regex(bytes, 0, bytes.length, Option.SINGLELINE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);


I did not find the Joni documentation, but this change makes more test cases pass; please carefully review.

Stephan202 · 2023-11-03T07:34:39Z

src/test/java/com/networknt/schema/regex/Issue814Test.java

@@ -8,40 +8,50 @@ class Issue814Test {

    @Test
    void jdkTypePattern() {
-        JDKRegularExpression ex = new JDKRegularExpression("^list|date|time|string|enum|int|double|long|boolean|number$");
+        JDKRegularExpression ex = new JDKRegularExpression("^(list|date|time|string|enum|int|double|long|boolean|number)$");


As mentioned by @kool79 in #872, if a user wishes only to match these words, then the parentheses are required.

Stephan202 · 2023-11-03T07:35:36Z

src/test/java/com/networknt/schema/regex/Issue814Test.java

    }

    @Test
    void jdkOptionsPattern() {
-        JDKRegularExpression ex = new JDKRegularExpression("^\\d*|[a-zA-Z_]+$");
+        JDKRegularExpression ex = new JDKRegularExpression("^\\d+|[a-zA-Z_]+$");


Note that I changed * to +: because ^\d* is vacuously true: it matches the empty string at the start of any input. (Without this the internal9 test case below would also yield true.)

Stephan202 · 2023-11-03T07:36:41Z

src/test/resources/draft2020-12/issue495.json

@@ -1,61 +1,105 @@
 [
    {
        "description": "issue495 using ECMA-262",
-        "regex": "ecma-262",


I moved this fields, because regex is defined in the TestSpec class, not inside TestCase. Without this the tests were skipped...

Stephan202 · 2023-11-03T07:46:17Z

src/test/resources/draft2020-12/issue495.json

-            "pattern": "^[a-z]{1,10}$",
+            "patternProperties": {
+              "^[a-z]{1,10}$": true,
+              "(^1$)": true
+            },


The pattern keyword didn't seem to work, so I switched to patternProperties; let me know if something else was intended. Also added an extra test case to show why inspecting the regex will not work in general.

Stephan202 · 2023-11-03T07:48:19Z

src/test/resources/draft2020-12/issue495.json

            {
                "description": "trailing newline",
+                "regex": "jdk",
                "data": { "aaa\n": 3 },
-                "valid": false
+                "valid": false,
+                "disabled": true,
+                "comment": "Test fails"
+            },


This is the only test case on which this PR regresses. Both regex implementation suffer from the same issue:

jshell> Pattern.compile("^a$").matcher("a\n").find() $1 ==> true jshell> Pattern.compile("(^a$)").matcher("a\n").find() $2 ==> true

Okay, I found why this happens. The documentation states:

The regular expression $ matches at the end of the entire input sequence, but also matches just before the last line terminator if this is not followed by any other input character. Other line terminators are ignored, including the last one if it is followed by other input characters.

So in theory we could avoid this by transforming all $ anchors by prepending a negative lookahead, like so:

jshell> Pattern.compile("^a(?!\n|\r\n|\r|\\u0085|\\u2028|\\u2029)$").matcher("a\n").find() $1 ==> false

But that will require careful parsing of the input regex, because the $ could be escaped, or inside a \Q...\E section, or... Quite frankly, that way lies madness; would like to keep it out of this PR.

kool79 · 2023-11-06T11:37:35Z

src/test/java/com/networknt/schema/regex/Issue814Test.java

+        JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
+        assertTrue(ex.matches("0external"));
        assertTrue(ex.matches("internal"));
        assertTrue(ex.matches("external"));
        assertTrue(ex.matches("external_gte"));
        assertTrue(ex.matches("force"));
+        assertFalse(ex.matches("internal9"));


@Stephan202 Actually the ^\\d+|[a-zA-Z_]+$ is the same as ^\\d|[a-zA-Z_]$
Explanation:
you should read the ^\\d+|[a-zA-Z_]+$ like: matches to ^\\d+ or to [a-zA-Z_]+$.
In 'human' words: starts with digit or ends with alpha-or-underscore.
Actually it is identical to: matches to ^\\d or to [a-zA-Z_]$ or ^\\d|[a-zA-Z_]$

Right you are!

kool79 · 2023-11-06T12:03:51Z

src/test/java/com/networknt/schema/regex/Issue814Test.java

    }

    @Test
    void joniOptionsPattern() {
-        JoniRegularExpression ex = new JoniRegularExpression("^\\d*|[a-zA-Z_]+$");
+        JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
+        assertTrue(ex.matches("0external"));
        assertTrue(ex.matches("internal"));
        assertTrue(ex.matches("external"));
        assertTrue(ex.matches("external_gte"));


I think that similar examples can be removed, and the next can be added:
(matches left part: starts with digit:)
assertTrue - "5"
assertTrue - "55"
assertTrue - "5%"
(matches right part: ends with alpha-or-undescore:)
assertTrue - "a"
assertTrue - "aa"
assertTrue - "%a"
assertTrue - "%_"
(matches both parts)
assertTrue - "55aa"
assertTrue - "5%%a"

assertFalse: ""
assertFalse: "%"
assertFalse: "a5"

Sure, lemme replace them.

Summary of changes: - Fix the test resources introduced by #783 by moving the `regex` fields, such that the test framework does not skip them with a "Not a valid test case" message. - Revert the changes introduced by #815, as those are simply incorrect. - Extend the test coverage introduced by #815 by (a) updating the test regexes to match their intended semantics and (b) include a few negative test cases. - Partially revert the change introduced by #783: the use of `Matcher#find()` is correct, but the `hasStartAnchor` and `hasEndAnchor` logic introduces more bugs than the issue it aims to solve. - Extend the test coverage introduced by #783, by introducing regexes that are not covered by the `hasStartAnchor`/`hasEndAnchor` logic. - Update the Joni regular expression integration such that it passes more of the test cases. - Disable the "trailing newline" test cases, as these are currently not handled correctly by either regex implementation.

Stephan202

Thanks for taking a look @kool79!

I squashed the following changes into the commit:

diff --git a/src/test/java/com/networknt/schema/regex/Issue814Test.java b/src/test/java/com/networknt/schema/regex/Issue814Test.java
index 3e7d42d..15ab31c 100644
--- a/src/test/java/com/networknt/schema/regex/Issue814Test.java
+++ b/src/test/java/com/networknt/schema/regex/Issue814Test.java
@@ -21,13 +21,19 @@ class Issue814Test {
 
     @Test
     void jdkOptionsPattern() {
-        JDKRegularExpression ex = new JDKRegularExpression("^\\d+|[a-zA-Z_]+$");
-        assertTrue(ex.matches("0external"));
-        assertTrue(ex.matches("external"));
-        assertTrue(ex.matches("external_gte"));
-        assertTrue(ex.matches("force"));
-        assertTrue(ex.matches("internal"));
-        assertFalse(ex.matches("internal9"));
+        JDKRegularExpression ex = new JDKRegularExpression("^\\d|[a-zA-Z_]$");
+        assertTrue(ex.matches("5"));
+        assertTrue(ex.matches("55"));
+        assertTrue(ex.matches("5%"));
+        assertTrue(ex.matches("a"));
+        assertTrue(ex.matches("aa"));
+        assertTrue(ex.matches("%a"));
+        assertTrue(ex.matches("%_"));
+        assertTrue(ex.matches("55aa"));
+        assertTrue(ex.matches("5%%a"));
+        assertFalse(ex.matches(""));
+        assertFalse(ex.matches("%"));
+        assertFalse(ex.matches("a5"));
     }
 
     @Test
@@ -45,13 +51,19 @@ class Issue814Test {
 
     @Test
     void joniOptionsPattern() {
-        JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
-        assertTrue(ex.matches("0external"));
-        assertTrue(ex.matches("internal"));
-        assertTrue(ex.matches("external"));
-        assertTrue(ex.matches("external_gte"));
-        assertTrue(ex.matches("force"));
-        assertFalse(ex.matches("internal9"));
+        JoniRegularExpression ex = new JoniRegularExpression("^\\d|[a-zA-Z_]$");
+        assertTrue(ex.matches("5"));
+        assertTrue(ex.matches("55"));
+        assertTrue(ex.matches("5%"));
+        assertTrue(ex.matches("a"));
+        assertTrue(ex.matches("aa"));
+        assertTrue(ex.matches("%a"));
+        assertTrue(ex.matches("%_"));
+        assertTrue(ex.matches("55aa"));
+        assertTrue(ex.matches("5%%a"));
+        assertFalse(ex.matches(""));
+        assertFalse(ex.matches("%"));
+        assertFalse(ex.matches("a5"));
     }
 
 }

(I squashed, because looking at the master branch, it appears that the maintainer of this repository uses GitHub's squash-and-rebase feature without cleaning up the commit message auto-generated by GitHub, which creates a rather messy result if there's more than one commit.)

Stephan202 · 2023-11-06T19:26:51Z

src/test/java/com/networknt/schema/regex/Issue814Test.java

+        JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
+        assertTrue(ex.matches("0external"));
        assertTrue(ex.matches("internal"));
        assertTrue(ex.matches("external"));
        assertTrue(ex.matches("external_gte"));
        assertTrue(ex.matches("force"));
+        assertFalse(ex.matches("internal9"));


Right you are!

Stephan202 · 2023-11-06T19:33:13Z

src/test/java/com/networknt/schema/regex/Issue814Test.java

    }

    @Test
    void joniOptionsPattern() {
-        JoniRegularExpression ex = new JoniRegularExpression("^\\d*|[a-zA-Z_]+$");
+        JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
+        assertTrue(ex.matches("0external"));
        assertTrue(ex.matches("internal"));
        assertTrue(ex.matches("external"));
        assertTrue(ex.matches("external_gte"));


Sure, lemme replace them.

stevehu · 2023-11-17T02:28:14Z

@Stephan202 @kool79 Thanks a lot for the hard work. Do you think we are ready to merge? @fdutton Do you have any concerns?

Stephan202 · 2023-11-17T06:13:51Z

@stevehu I think that not supporting the trailing newline case is an acceptable compromise given the larger context; I'm not planning other changes at this time 👍

(We should trigger the build workflow, though, to make sure that the tests also pass on GHA.)

codecov-commenter · 2023-11-17T13:09:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (e86c817) 78.48% compared to head (636e696) 78.52%.
Report is 3 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff              @@
##             master     #888      +/-   ##
============================================
+ Coverage     78.48%   78.52%   +0.03%     
+ Complexity     1213     1207       -6     
============================================
  Files           120      120              
  Lines          3891     3883       -8     
  Branches        746      740       -6     
============================================
- Hits           3054     3049       -5     
  Misses          546      546              
+ Partials        291      288       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stevehu · 2023-12-01T21:16:34Z

Do you feel comfortable to merge this PR? Anyone has some concerns? Thanks.

Stephan202 · 2023-12-02T10:26:45Z

@stevehu yes, I feel comfortable doing so 👍 (But don't have permission to ;).)

Stephan202 commented Nov 3, 2023

View reviewed changes

Stephan202 mentioned this pull request Nov 3, 2023

Please revert fix from #815 (pattern validation) #872

Closed

kool79 reviewed Nov 6, 2023

View reviewed changes

Stephan202 commented Nov 6, 2023

View reviewed changes

stevehu requested a review from fdutton November 17, 2023 02:26

stevehu merged commit 0924dfe into networknt:master Dec 2, 2023
4 checks passed

Stephan202 deleted the bugfix/fix-jdk-regex-support branch December 2, 2023 21:40

justin-tay mentioned this pull request Jun 13, 2024

PatternValidator does not correctly validate input with newlines #495

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix JDK regex support #888

Fix JDK regex support #888

Stephan202 commented Nov 3, 2023

Stephan202 left a comment

Stephan202 Nov 3, 2023

Stephan202 Nov 3, 2023

Stephan202 Nov 3, 2023

Stephan202 Nov 3, 2023

Stephan202 Nov 3, 2023

Stephan202 Nov 3, 2023

Stephan202 Nov 3, 2023

Stephan202 Nov 3, 2023

kool79 Nov 6, 2023

Stephan202 Nov 6, 2023

kool79 Nov 6, 2023

Stephan202 Nov 6, 2023

Stephan202 left a comment

Stephan202 Nov 6, 2023

Stephan202 Nov 6, 2023

stevehu commented Nov 17, 2023

Stephan202 commented Nov 17, 2023

codecov-commenter commented Nov 17, 2023

stevehu commented Dec 1, 2023

Stephan202 commented Dec 2, 2023

Fix JDK regex support #888

Fix JDK regex support #888

Conversation

Stephan202 commented Nov 3, 2023

Stephan202 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stephan202 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevehu commented Nov 17, 2023

Stephan202 commented Nov 17, 2023

codecov-commenter commented Nov 17, 2023

Codecov Report

stevehu commented Dec 1, 2023

Stephan202 commented Dec 2, 2023