Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix JDK regex support #888

Merged
merged 1 commit into from
Dec 2, 2023
Merged

Fix JDK regex support #888

merged 1 commit into from
Dec 2, 2023

Conversation

Stephan202
Copy link
Contributor

Summary of changes:

Copy link
Contributor Author

@Stephan202 Stephan202 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some context. Still thinking about the "trailing newline" cases, but didn't find any solution yet.

This PR resolves #872.

Suggested commit message:

Fix JDK regex support

Summary of changes:
- Fix the test resources introduced by #783 by moving the `regex`
  fields, such that the test framework does not skip them with a "Not a
  valid test case" message.
- Revert the changes introduced by #815, as those are simply incorrect.
- Extend the test coverage introduced by #815 by (a) updating the test
  regexes to match their intended semantics and (b) include a few 
  negative test cases.
- Partially revert the change introduced by #783: the use of
  `Matcher#find()` is correct, but the `hasStartAnchor` and 
  `hasEndAnchor` logic introduces more bugs than the issue it aims to
  solve.
- Extend the test coverage introduced by #783, by introducing regexes
  that are not covered by the `hasStartAnchor`/`hasEndAnchor` logic.
- Update the Joni regular expression integration such that it passes
  more of the test cases.
- Disable the "trailing newline" test cases, as these are currently not 
  handled correctly by either regex implementation.

Resolves #872.

}

@Override
public boolean matches(String value) {
Matcher matcher = this.pattern.matcher(value);
return matcher.find() && (!this.hasStartAnchor || 0 == matcher.start()) && (!this.hasEndAnchor || matcher.end() == value.length());
return this.pattern.matcher(value).find();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the original suggestion by @tvahrst in #782, and seems to be closest to what the spec intents (modulo the newline case).

@@ -21,7 +21,7 @@ class JoniRegularExpression implements RegularExpression {
.replace("\\S", "[^ \\f\\n\\r\\t\\v\\u00a0\\u1680\\u2000-\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000\\ufeff]");

byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
this.pattern = new Regex(bytes, 0, bytes.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
this.pattern = new Regex(bytes, 0, bytes.length, Option.SINGLELINE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find the Joni documentation, but this change makes more test cases pass; please carefully review.

@@ -8,40 +8,50 @@ class Issue814Test {

@Test
void jdkTypePattern() {
JDKRegularExpression ex = new JDKRegularExpression("^list|date|time|string|enum|int|double|long|boolean|number$");
JDKRegularExpression ex = new JDKRegularExpression("^(list|date|time|string|enum|int|double|long|boolean|number)$");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by @kool79 in #872, if a user wishes only to match these words, then the parentheses are required.

}

@Test
void jdkOptionsPattern() {
JDKRegularExpression ex = new JDKRegularExpression("^\\d*|[a-zA-Z_]+$");
JDKRegularExpression ex = new JDKRegularExpression("^\\d+|[a-zA-Z_]+$");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I changed * to +: because ^\d* is vacuously true: it matches the empty string at the start of any input. (Without this the internal9 test case below would also yield true.)

@@ -1,61 +1,105 @@
[
{
"description": "issue495 using ECMA-262",
"regex": "ecma-262",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this fields, because regex is defined in the TestSpec class, not inside TestCase. Without this the tests were skipped...

Comment on lines -7 to +9
"pattern": "^[a-z]{1,10}$",
"patternProperties": {
"^[a-z]{1,10}$": true,
"(^1$)": true
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern keyword didn't seem to work, so I switched to patternProperties; let me know if something else was intended. Also added an extra test case to show why inspecting the regex will not work in general.

Comment on lines 78 to +85
{
"description": "trailing newline",
"regex": "jdk",
"data": { "aaa\n": 3 },
"valid": false
"valid": false,
"disabled": true,
"comment": "Test fails"
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only test case on which this PR regresses. Both regex implementation suffer from the same issue:

jshell> Pattern.compile("^a$").matcher("a\n").find()
$1 ==> true
jshell> Pattern.compile("(^a$)").matcher("a\n").find()
$2 ==> true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I found why this happens. The documentation states:

The regular expression $ matches at the end of the entire input sequence, but also matches just before the last line terminator if this is not followed by any other input character. Other line terminators are ignored, including the last one if it is followed by other input characters.

So in theory we could avoid this by transforming all $ anchors by prepending a negative lookahead, like so:

jshell> Pattern.compile("^a(?!\n|\r\n|\r|\\u0085|\\u2028|\\u2029)$").matcher("a\n").find()
$1 ==> false

But that will require careful parsing of the input regex, because the $ could be escaped, or inside a \Q...\E section, or... Quite frankly, that way lies madness; would like to keep it out of this PR.

Comment on lines 48 to 54
JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
assertTrue(ex.matches("0external"));
assertTrue(ex.matches("internal"));
assertTrue(ex.matches("external"));
assertTrue(ex.matches("external_gte"));
assertTrue(ex.matches("force"));
assertFalse(ex.matches("internal9"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stephan202 Actually the ^\\d+|[a-zA-Z_]+$ is the same as ^\\d|[a-zA-Z_]$
Explanation:
you should read the ^\\d+|[a-zA-Z_]+$ like: matches to ^\\d+ or to [a-zA-Z_]+$.
In 'human' words: starts with digit or ends with alpha-or-underscore.
Actually it is identical to: matches to ^\\d or to [a-zA-Z_]$ or ^\\d|[a-zA-Z_]$

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right you are!

}

@Test
void joniOptionsPattern() {
JoniRegularExpression ex = new JoniRegularExpression("^\\d*|[a-zA-Z_]+$");
JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
assertTrue(ex.matches("0external"));
assertTrue(ex.matches("internal"));
assertTrue(ex.matches("external"));
assertTrue(ex.matches("external_gte"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that similar examples can be removed, and the next can be added:
(matches left part: starts with digit:)
assertTrue - "5"
assertTrue - "55"
assertTrue - "5%"
(matches right part: ends with alpha-or-undescore:)
assertTrue - "a"
assertTrue - "aa"
assertTrue - "%a"
assertTrue - "%_"
(matches both parts)
assertTrue - "55aa"
assertTrue - "5%%a"

assertFalse: ""
assertFalse: "%"
assertFalse: "a5"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, lemme replace them.

Summary of changes:
- Fix the test resources introduced by #783 by moving the `regex`
  fields, such that the test framework does not skip them with a "Not a
  valid test case" message.
- Revert the changes introduced by #815, as those are simply incorrect.
- Extend the test coverage introduced by #815 by (a) updating the test
  regexes to match their intended semantics and (b) include a few
  negative test cases.
- Partially revert the change introduced by #783: the use of
  `Matcher#find()` is correct, but the `hasStartAnchor` and
  `hasEndAnchor` logic introduces more bugs than the issue it aims to
  solve.
- Extend the test coverage introduced by #783, by introducing regexes
  that are not covered by the `hasStartAnchor`/`hasEndAnchor` logic.
- Update the Joni regular expression integration such that it passes
  more of the test cases.
- Disable the "trailing newline" test cases, as these are currently not
  handled correctly by either regex implementation.
Copy link
Contributor Author

@Stephan202 Stephan202 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @kool79!

I squashed the following changes into the commit:

diff --git a/src/test/java/com/networknt/schema/regex/Issue814Test.java b/src/test/java/com/networknt/schema/regex/Issue814Test.java
index 3e7d42d..15ab31c 100644
--- a/src/test/java/com/networknt/schema/regex/Issue814Test.java
+++ b/src/test/java/com/networknt/schema/regex/Issue814Test.java
@@ -21,13 +21,19 @@ class Issue814Test {
 
     @Test
     void jdkOptionsPattern() {
-        JDKRegularExpression ex = new JDKRegularExpression("^\\d+|[a-zA-Z_]+$");
-        assertTrue(ex.matches("0external"));
-        assertTrue(ex.matches("external"));
-        assertTrue(ex.matches("external_gte"));
-        assertTrue(ex.matches("force"));
-        assertTrue(ex.matches("internal"));
-        assertFalse(ex.matches("internal9"));
+        JDKRegularExpression ex = new JDKRegularExpression("^\\d|[a-zA-Z_]$");
+        assertTrue(ex.matches("5"));
+        assertTrue(ex.matches("55"));
+        assertTrue(ex.matches("5%"));
+        assertTrue(ex.matches("a"));
+        assertTrue(ex.matches("aa"));
+        assertTrue(ex.matches("%a"));
+        assertTrue(ex.matches("%_"));
+        assertTrue(ex.matches("55aa"));
+        assertTrue(ex.matches("5%%a"));
+        assertFalse(ex.matches(""));
+        assertFalse(ex.matches("%"));
+        assertFalse(ex.matches("a5"));
     }
 
     @Test
@@ -45,13 +51,19 @@ class Issue814Test {
 
     @Test
     void joniOptionsPattern() {
-        JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
-        assertTrue(ex.matches("0external"));
-        assertTrue(ex.matches("internal"));
-        assertTrue(ex.matches("external"));
-        assertTrue(ex.matches("external_gte"));
-        assertTrue(ex.matches("force"));
-        assertFalse(ex.matches("internal9"));
+        JoniRegularExpression ex = new JoniRegularExpression("^\\d|[a-zA-Z_]$");
+        assertTrue(ex.matches("5"));
+        assertTrue(ex.matches("55"));
+        assertTrue(ex.matches("5%"));
+        assertTrue(ex.matches("a"));
+        assertTrue(ex.matches("aa"));
+        assertTrue(ex.matches("%a"));
+        assertTrue(ex.matches("%_"));
+        assertTrue(ex.matches("55aa"));
+        assertTrue(ex.matches("5%%a"));
+        assertFalse(ex.matches(""));
+        assertFalse(ex.matches("%"));
+        assertFalse(ex.matches("a5"));
     }
 
 }

(I squashed, because looking at the master branch, it appears that the maintainer of this repository uses GitHub's squash-and-rebase feature without cleaning up the commit message auto-generated by GitHub, which creates a rather messy result if there's more than one commit.)

Comment on lines 48 to 54
JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
assertTrue(ex.matches("0external"));
assertTrue(ex.matches("internal"));
assertTrue(ex.matches("external"));
assertTrue(ex.matches("external_gte"));
assertTrue(ex.matches("force"));
assertFalse(ex.matches("internal9"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right you are!

}

@Test
void joniOptionsPattern() {
JoniRegularExpression ex = new JoniRegularExpression("^\\d*|[a-zA-Z_]+$");
JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
assertTrue(ex.matches("0external"));
assertTrue(ex.matches("internal"));
assertTrue(ex.matches("external"));
assertTrue(ex.matches("external_gte"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, lemme replace them.

@stevehu stevehu requested a review from fdutton November 17, 2023 02:26
@stevehu
Copy link
Contributor

stevehu commented Nov 17, 2023

@Stephan202 @kool79 Thanks a lot for the hard work. Do you think we are ready to merge? @fdutton Do you have any concerns?

@Stephan202
Copy link
Contributor Author

@stevehu I think that not supporting the trailing newline case is an acceptable compromise given the larger context; I'm not planning other changes at this time 👍

(We should trigger the build workflow, though, to make sure that the tests also pass on GHA.)

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (e86c817) 78.48% compared to head (636e696) 78.52%.
Report is 3 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #888      +/-   ##
============================================
+ Coverage     78.48%   78.52%   +0.03%     
+ Complexity     1213     1207       -6     
============================================
  Files           120      120              
  Lines          3891     3883       -8     
  Branches        746      740       -6     
============================================
- Hits           3054     3049       -5     
  Misses          546      546              
+ Partials        291      288       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stevehu
Copy link
Contributor

stevehu commented Dec 1, 2023

Do you feel comfortable to merge this PR? Anyone has some concerns? Thanks.

@Stephan202
Copy link
Contributor Author

@stevehu yes, I feel comfortable doing so 👍 (But don't have permission to ;).)

@stevehu stevehu merged commit 0924dfe into networknt:master Dec 2, 2023
4 checks passed
@Stephan202 Stephan202 deleted the bugfix/fix-jdk-regex-support branch December 2, 2023 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants