Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Augment Regex extensibility point for better perf and span-based matching #59629

Closed
stephentoub opened this issue Sep 27, 2021 · 12 comments · Fixed by #65473
Closed

Augment Regex extensibility point for better perf and span-based matching #59629

stephentoub opened this issue Sep 27, 2021 · 12 comments · Fixed by #65473
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Sep 27, 2021

Background and motivation

There are several related Regex efforts planned for .NET 7:

  1. Adding a regex source generator (merged into runtime in Add initial regex source generator #59186)
  2. Adding a DFA-based engine (merged into runtime in Add RegexOptions.NonBacktracking #60607)

On top of that, there are several long-standing issues to be addressed:

  1. Add support for matching with ReadOnlySpan<char> inputs in addition to string inputs.
  2. Allow regex matching to be zero-alloc. Today, finding the locations of matches requires allocating Match objects (and supporting data structures) for every match (on my 64-bit machine, a successful Match(string) with no subcaptures allocates three objects for a total of ~170b).
  3. Continual push to improve performance of matching. Regex optimization opportunities #1349)

Each of these brings with it some difficulties that, altogether, suggest we augment how Regex supports extensibility. Regex enables compilation of regular expressions via an extensibility point (such compilation includes in-memory with RegexOptions.Compiled, Regex.CompileToAssembly on .NET Framework, and now the RegexGenerator in .NET 7). Regex exposes several protected fields (!), one of which is of type RegexRunnerFactory. When the public APIs on Regex need to perform a match, they do so via a RegexRunner, and if no RegexRunner is available (e.g. we’ve not performed a match yet), that RegexRunnerFactory is used to create one. RegexRunner then has several abstract methods, the most important ones being bool FindFirstChar() and void Go(). Thus, when we compile a regex, we need to produce a RegexRunner-derived type that overrides FindFirstChar/Go, a RegexRunnerFactory-derived type that overrides CreateInstance(), and a Regex-derived type that sets the protected factory field to an instance of this RegexRunnerFactory. This is exactly what the source generator spits out for a given use of [RegexGenerator(…)].

Now, there are a few relevant issues here:

  1. We can’t store a span onto a RegexRunner, and as FindFirstChar/Go are parameterless, we don’t have any way to pass a span into them. Thus, we currently lack a means for having the code generated in overrides of these from operating on ReadOnlySpan<char>. To match an input ReadOnlySpan<char>, we could convert the input ReadOnlySpan<char> into a string, but obviously that defeats the point.
  2. The original split between FindFirstChar and Go was likely predicated on the notion that there was no benefit to combining them because the employed matching algorithm logically splits them in this way: find a place that could match, try to match, find the next place, try to match, etc. But that is not how a DFA-based engine operates, where every character moves to the next state, whether that character is ultimately part of the match or not. This split is thus very artificial for a DFA-based matcher, which essentially has all of the logic in Go, with FindFirstChar effectively being a nop.
  3. With our current engine, it’s effectively zero cost to determine the bounds of a match once we know that there is a match. With the DFA-based engine, we don’t know the starting position of a match when we determine there is a match, only the ending position, and thus we need to do more work to determine the bounds of the match. These costs could be avoided for IsMatch operations, where the starting position isn’t relevant, but whereas the internal loop today that invokes FindFirstChar/Go knows whether this is IsMatch or not, that information isn’t available to overrides of FindFirstChar/Go, which means today it’s unavailable to source generated code.
  4. Even for the existing implementation, the split between FindFirstChar and Go leads to various inefficiencies. For example, every time we find a possible match location, we need to make two virtual calls, one to FindFirstChar and one to Go. Each of those calls needs to load the relevant state from fields. We end up duplicating some matching logic because FindFirstChar wants to validate some things about the match before handing it off to Go in order to avoid those exact costs through the outer bump loop. Etc.

These issues could all be addressed internally to the library, except that doing so can’t support the source generator, as doing so would require internal APIs that the code it generates can’t see. We thus need just enough new surface area to support this. We also want to address this in the same release as shipping the source generator initially; otherwise, any implementations generated based on .NET 7 would lack the support necessary to enable these things, and all such libraries would need to be recompiled to benefit, e.g. we could make an IsMatch(ReadOnlySpan<char>) work without recompilation, but it would involving converting the span to a string.

This issue mostly replaces #23602. However, that issue proposes additional span-based Split, Replace, and {Un}Escape methods, all of which could be built once this functionality is in place.

API Proposal

We should:

  1. Expose a new virtual RegexRunner.Scan method:
namespace System.Text.RegularExpressions
{
    public class RegexRunner
    {
+        protected virtual void Scan(ReadOnlySpan<char> text);
    }
}

The base implementation would effectively contain the existing bumpalong loop that invokes both FindFirstChar and Go. It would compare the provided span against runtext and set runtext to span.ToString() only if they differ (as noted below, this base implementation will rarely be used and will be there purely for compat).

  1. Override this Scan method on all our implementations and in the source generator-generated code: the only time you’d get the base implementation would be if you were using an old .NET Framework-generated CompileToAssembly assembly, and the only time it would lead to inefficiencies is if you then used those in combination with the new span-based public APIs. The implementations of FindFirstChar/Go for the interpreter, RegexOptions.Compiled, and source generator would all be combined into this single method. NonBacktracking effectively already has its combined.

  2. Make the existing FindFirstChar/Go/InitTrackCount methods virtual instead of abstract. The first two would throw NotImplementedExceptions; the third would return 0. None of our implementations would need to override these methods anymore, just Scan.

namespace System.Text.RegularExpressions
{
  public class RegexRunner
  {
-    protected abstract bool FindFirstChar();
-    protected abstract void Go();
-    protected abstract void InitTrackCount();
+    protected virtual bool FindFirstChar() => throw new NotImplementedException(); //default implementation
+    protected virtual void Go() => throw new NotImplementedException(); //default implementation
+    protected virtual void InitTrackCount() => return 0; //default implementation
  }
}
  1. Add new public IsMatch methods to Regex (these are the same set of overloads as for string, but with a ReadOnlySpan<char> input):
namespace System.Text.RegularExpressions
{
    public class Regex
    {
+        public bool IsMatch(ReadOnlySpan<char> input);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
    }
}

API Usage

bool matched = Regex.IsMatch(spanInput, "pattern");

Risks

No response

@stephentoub stephentoub added api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Text.RegularExpressions labels Sep 27, 2021
@stephentoub stephentoub added this to the 7.0.0 milestone Sep 27, 2021
@ghost
Copy link

ghost commented Sep 27, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

There are several related Regex efforts planned for .NET 7:

  1. Adding a regex source generator (merged into runtime in Add initial regex source generator #59186)
  2. Adding a DFA-based engine (currently being worked on in runtimelab)

On top of that, there are several long-standing issues to be addressed:

  1. Add support for matching with ReadOnlySpan<char> inputs in addition to string inputs.
  2. Allow regex matching to be zero-alloc. Today, finding the locations of matches requires allocating Match objects (and supporting data structures) for every match (on my 64-bit machine, a successful Match(string) with no subcaptures allocates three objects for a total of ~170b).
  3. Continual push to improve performance of matching. Regex optimization opportunities #1349)

Each of these brings with it some difficulties that, altogether, suggest we augment how Regex supports extensibility. Regex enables compilation of regular expressions via an extensibility point (such compilation includes in-memory with RegexOptions.Compiled, Regex.CompileToAssembly on .NET Framework, and now the RegexGenerator in .NET 7). Regex exposes several protected fields (!), one of which is of type RegexRunnerFactory. When the public APIs on Regex need to perform a match, they do so via a RegexRunner, and if no RegexRunner is available (e.g. we’ve not performed a match yet), that RegexRunnerFactory is used to create one. RegexRunner then has several abstract methods, the most important ones being bool FindFirstChar() and void Go(). Thus, when we compile a regex, we need to produce a RegexRunner-derived type that overrides FindFirstChar/Go, a RegexRunnerFactory-derived type that overrides CreateInstance(), and a Regex-derived type that sets the protected factory field to an instance of this RegexRunnerFactory. This is exactly what the source generator spits out for a given use of [RegexGenerator(…)].

Now, there are a few relevant issues here:

  1. We can’t store a span onto a RegexRunner, and as FindFirstChar/Go are parameterless, we don’t have any way to pass a span into them. Thus, we currently lack a means for having the code generated in overrides of these from operating on ReadOnlySpan<char>. To match an input ReadOnlySpan<char>, we could convert the input ReadOnlySpan<char> into a string, but obviously that defeats the point.
  2. The original split between FindFirstChar and Go was likely predicated on the notion that there was no benefit to combining them because the employed matching algorithm logically splits them in this way: find a place that could match, try to match, find the next place, try to match, etc. But that is not how a DFA-based engine operates, where every character moves to the next state, whether that character is ultimately part of the match or not. This split is thus very artificial for a DFA-based matcher, which essentially has all of the logic in Go, with FindFirstChar effectively being a nop.
  3. With our current engine, it’s effectively zero cost to determine the bounds of a match once we know that there is a match. With the DFA-based engine, we don’t know the starting position of a match when we determine there is a match, only the ending position, and thus we need to do more work to determine the bounds of the match. These costs could be avoided for IsMatch operations, where the starting position isn’t relevant, but whereas the internal loop today that invokes FindFirstChar/Go knows whether this is IsMatch or not, that information isn’t available to overrides of FindFirstChar/Go, which means today it’s unavailable to source generated code.
  4. Even for the existing implementation, the split between FindFirstChar and Go leads to various inefficiencies. For example, every time we find a possible match location, we need to make two virtual calls, one to FindFirstChar and one to Go. Each of those calls needs to load the relevant state from fields. We end up duplicating some matching logic because FindFirstChar wants to validate some things about the match before handing it off to Go in order to avoid those exact costs through the outer bump loop. Etc.

These issues could all be addressed internally to the library, except that doing so can’t support the source generator, as doing so would require internal APIs that the code it generates can’t see. We thus need just enough new surface area to support this. We also want to address this in the same release as shipping the source generator initially; otherwise, any implementations generated based on .NET 7 would lack the support necessary to enable these things, and all such libraries would need to be recompiled to benefit, e.g. we could make an IsMatch(ReadOnlySpan<char>) work without recompilation, but it would involving converting the span to a string.

This issue mostly replaces #23602. However, that issue proposes additional span-based Split, Replace, and {Un}Escape methods, all of which could be built once this functionality is in place.

API Proposal

This is a work-in-progress. We will need to prototype the APIs to confirm they’re both necessary and sufficient, and augment the proposal until they are.

We should:

  1. Expose a new virtual RegexRunner.Scan method:
namespace System.Text.RegularExpressions
{
    public class RegexRunner
    {
+        protected virtual void Scan(ReadOnlySpan<char> text);
    }
}

The base implementation would effectively contain the existing bumpalong loop that invokes both FindFirstChar and Go. It would compare the provided span against runtext and set runtext to span.ToString() only if they differ (as noted below, this base implementation will rarely be used and will be there purely for compat). We would also add several new protected fields or properties to go along with the ones already there, in order to expose further information to the Scan method (exposing protected fields makes me feel dirty, but it’s what RegexRunner already does with lots of fields):

namespace System.Text.RegularExpressions
{
    public class RegexRunner
    {
+        protected bool quick; // or `bool isMatch`
+        protected int timeoutMilliseconds;
+        protected int previousMatchLength;
    }
}

These are all inputs today to the internal Scan loop function and would be needed for a fully-functioning externally-provided implementation with the source generator. Alternatively, we could pass all of the values as additional arguments to the method.

  1. Override this Scan method on all our implementations and in the source generator-generated code: the only time you’d get the base implementation would be if you were using an old .NET Framework-generated CompileToAssembly assembly, and the only time it would lead to inefficiencies is if you then used those in combination with the new span-based public APIs. The implementations of FindFirstChar/Go for the interpreter, RegexOptions.Compiled, and source generator would all be combined into this single method. NonBacktracking effectively already has its combined.

  2. Make the existing FindFirstChar/Go/InitTrackCount methods virtual instead of abstract. The first two would throw NotImplementedExceptions; the third would return 0. None of our implementations would need to override these methods anymore, just Scan.

  3. Add new public IsMatch methods to Regex (these are the same set of overloads as for string, but with a ReadOnlySpan<char> input):

namespace System.Text.RegularExpressions
{
    public class Regex
    {
+        public bool IsMatch(ReadOnlySpan<char> input);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
    }
}
  1. Add a set of new Enumerate methods to Regex:
namespace System.Text.RegularExpressions
{
    public class Regex
    {
+        public MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input);
+        public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern);
+        public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+        public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);

+        public ref struct MatchPositionEnumerator
+        {
+            public MatchPositionEnumerator GetEnumerator();
+            public bool MoveNext();
+            public Range Current { get; } // could also add a new struct for the return type here rather than System.Range
+        }
    }
}

Enumerate would not permit access to the full list of groups and captures, just the index/offset of the top-level capture, but in doing so, these Enumerate methods can become amortized zero-alloc: the enumerator is a ref struct, no objects are yielded, the input is a span, and the matching engine can reuse the internal Match object (and its supporting arrays) just as is done today with IsMatch to make it ammortized zero-alloc. If someone still needs the full details, they can fall back to using strings to begin with and the existing either Match or Matches, or (for some patterns, e.g. ones that don’t have anchors or lookaheads or lookbehinds that might go beyond the matching boundaries) re-run the engine with Match(matchString) just on the string representing the area of the input that matched. (The trouble with adding Regex.Match/Matches overloads for spans is the Match and MatchCollection types can’t store a Span; thus various surface area on these types couldn’t function with spans, like NextMatch… if we were to accept that, we could add span-based methods for those as well, but it would likely be confusing and inconsistent).

API Usage

bool matched = Regex.IsMatch(spanInput, "pattern");
...
foreach (Range r in Regex.Enumerate(spanInput, “pattern”))
{
    Process(spanInput[r]);
} 

Risks

No response

Author: stephentoub
Assignees: -
Labels:

api-needs-work, area-System.Text.RegularExpressions

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 27, 2021
@stephentoub
Copy link
Member Author

@jeffhandley, this would be a good issue for someone as a way to ramp up on S.T.RE while also accomplishing several impactful results. It could be split into a handful of PRs, each moving the ball forward, e.g.

PRs:

  1. Update the engines to read base.runtext once at the beginning of FindFirstChar and Go, AsSpan() it into a local, and then all reads of the text should be via that local for the rest of the method. All engines could be done together, or one PR each. No test changes should be needed. At that point, the implementations are all span-based, and everything else is hooking stuff up.

  2. (Optional) Update the implementations to slice that span once upfront based on runtextend (and possibly runtextbeg, though that could complicate some other things around the exact indices returned for matches), and then all subsequent checks can be based on the span length instead of runtextend (and maybe 0 instead of runtextbeg). The primary goal would be cleaning up the implementation a bit and helping perf by removing bounds checks in some cases on the span accesses. This would help highlight benefits of using span internally, but is not strictly necessary for the rest of the work.

  3. Refactor the existing Scan loop into an internal virtual method that looks like the one we'd want to expose as protected.

  4. Override that new virtual on each of the engines, other than in the source generator, which would require the functionality be exposed publicly. This could be one PR per engine or one all-up. This should include changing the existing FindFirstChar/Go overrides to be non-virtual and accept the input span from the Scan loop; the abstract methods can then be temporarily overridden with methods that just throw NotSupportedException.

At this point, the internal implementations should be able to fully support span inputs (which can't be passed in yet), and it should be trivial to prototype the public surface area in order to confirm it works and get it approved. At which point...

  1. Make FindFirstChar/Go virtual.

  2. Expose the new protected surface area and update the source generator to target it, overriding Scan instead of FindFirstChar/Go as was done with the other internal engines.

  3. Add the new public IsMatch methods taking spans.

  4. Add the new public Enumerate methods.

  5. (Optional) With one PR per interpreter, compiled, and source generator, update the implementations to combine FindFirstChar/Go into a single method, rather than being a loop that calls both. The goal here is perf, and in the case of the source generator, easier to understand code flow for someone reading/debugging the generated source.

@stephentoub
Copy link
Member Author

cc: @veanes, @olsaarik

@svick
Copy link
Contributor

svick commented Sep 27, 2021

Small question about the new static APIs: should they add several overloads (to stay consistent with the old API) or would it be better to add just one overload with optional parameters (to make the API simpler and easier to understand)?

@stephentoub
Copy link
Member Author

stephentoub commented Sep 27, 2021

Small question about the new static APIs: should they add several overloads (to stay consistent with the old API) or would it be better to add just one overload with optional parameters (to make the API simpler and easier to understand)?

TimeSpan's can't be optional, and the TimeSpan matchTimeout should be at the end to be consistent with all the other methods; it could be made nullable, but that then introduces multiple values for infinite, which I don't love. Combine that with a desire for consistency across all the APIs, and my preference is to use multiple overloads instead of optional parameters here. But it's something that can be discussed in API review.

@joperezr
Copy link
Member

joperezr commented Dec 9, 2021

protected virtual void Scan(ReadOnlySpan text);

Regarding this API suggestion, I suppose you meant protected internal virtual... instead of just protected right? If not, then it isn't possible to call it from the Regex.Run method.

@stephentoub
Copy link
Member Author

I suppose you meant protected internal virtual

Yes. (I left out the internal part as it's an implementation detail and not part of the public signature.)

@joperezr
Copy link
Member

joperezr commented Dec 9, 2021

Another thing that would need to be exposed in order for this to work (with the source generator) I believe is something in RegexRunner to be used as a temporary match object that the new virtual Scan would be in charge of setting. Basically, the new Span-based Scan method would be in charge of gathering the captures for the passed-in span, and then saving those into this new protected match field which could then be read by RegexRunner.Run() method to see if the Match was successful or not. We could consider using the existing protected RegexRunner.runmatch field for this, but today this one is reused for doing Match.NextMatch() on the Match object returned by the existing Match() methods, so we probably don't want to use that one.

The additional proposed field would be something like:

namespace System.Text.RegularExpressions
{
    public class RegexRunner
    {
+        protected Match? scanMatchResult;
    }
}

@stephentoub
Copy link
Member Author

@joperezr, can you help me understand why that's needed? You may very well be right, but I'm not seeing it. For example, SymbolicRegexRunner.Go is essentially the new Scan (albeit without the span argument) and doesn't need such a field:


It just uses Capture, and then the caller can see whether a match occurred by checking the state set by Capture.

@joperezr
Copy link
Member

but today this one is reused for doing Match.NextMatch() on the Match object returned

Looks like I got confused a bit and I was wrong about this assumption. That said, I did see that runmatch is not always set back to null after running Scan, for example, when quick is true (meaning when just doing IsMatch() and not actually returning Match to caller):

if (quick)
{
runmatch!.Text = null!; // drop reference
return null;
}

As well as when using Split/Replace:

if (!reuseMatchObject)
{
// We're not reusing match objects, so null out our field reference to the instance.
// It'll be recreated the next time one is needed.
runmatch = null;
}

In those cases, runmatch will be kept alive and reused and simply just reset to the right positions using:

if (runmatch is null)
{
// Use a hashtabled Match object if the capture numbers are sparse
runmatch = runregex!.caps is null ?
new Match(runregex, runregex.capsize, runtext!, runtextbeg, runtextend - runtextbeg, runtextstart) :
new MatchSparse(runregex, runregex.caps, runregex.capsize, runtext!, runtextbeg, runtextend - runtextbeg, runtextstart);
}
else
{
runmatch.Reset(runregex!, runtext!, runtextbeg, runtextend, runtextstart);
}

This may mean that it is ok for us to reuse since it will never be handed out to caller, so I will continue the prototype to see if it can be reused or if we would need the additional protected field in order to not break an existing scenario.

@joperezr joperezr removed the api-needs-work API needs work before it is approved, it is NOT ready for implementation label Feb 8, 2022
@joperezr joperezr added the api-ready-for-review API is ready for review, it is NOT ready for implementation label Feb 8, 2022
@joperezr
Copy link
Member

joperezr commented Feb 8, 2022

I have updated the top level proposal and factored out the Enumerate method for now. The proposed APIs have been prototyped in the following draft PR: #62509

@terrajobst terrajobst added the blocking Marks issues that we want to fast track in order to unblock other important work label Feb 14, 2022
@bartonjs
Copy link
Member

Looks good as proposed

namespace System.Text.RegularExpressions
{
    public class RegexRunner
    {
-        protected abstract bool FindFirstChar();
-        protected abstract void Go();
-        protected abstract void InitTrackCount();
+        protected virtual bool FindFirstChar() => throw new NotImplementedException(); //default implementation
+        protected virtual void Go() => throw new NotImplementedException(); //default implementation
+        protected virtual void InitTrackCount() => return 0; //default implementation

+        protected virtual void Scan(ReadOnlySpan<char> text);

    }
    public class Regex
    {
+        public bool IsMatch(ReadOnlySpan<char> input);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation blocking Marks issues that we want to fast track in order to unblock other important work labels Feb 15, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 17, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 26, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants