Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Regex.Count #61425

Closed
stephentoub opened this issue Nov 10, 2021 · 9 comments · Fixed by #66026
Closed

[API Proposal]: Regex.Count #61425

stephentoub opened this issue Nov 10, 2021 · 9 comments · Fixed by #66026
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Nov 10, 2021

Background and motivation

With regexes, you often want to know how many matches there are in a given piece of text. There are various ways you can do that today, including:

public static int Count(Regex r, string input)
{
    int count = 0;
    Match m = r.Match(input);
    while (m.Success)
    {
        count++;
        m = m.NextMatch();
    }
    return count;
}

or

public static int Count(Regex r, string input) => r.Matches(input).Count();

but these have a variety of downsides. In addition to being non-obvious, they're more costly than they need to be. Both of them force Match objects and all the state they wrap (captures, etc.) into existence, when for counting that's not necessary, and the latter also forces the MatchCollection into existence and holds on to all of the Match objects until after the operation has completed.

We should add Count methods to Regex for this relatively common case.

API Proposal

namespace System.Text.RegularExpressions
{
    public class Regex
    {
+        public int Count(string input);
+        public static int Count(string input, string pattern);
+        public static int Count(string input, string pattern, RegexOptions options);
+        public static int Count(string input, string pattern, RegexOptions options, TimeSpan matchTimeout);

          // And once https://github.com/dotnet/runtime/issues/59629 is approved and we support spans
+        public int Count(ReadOnlySpan<char> input);
+        public static int Count(ReadOnlySpan<char>, string pattern);
+        public static int Count(ReadOnlySpan<char>, string pattern, RegexOptions options);
+        public static int Count(ReadOnlySpan<char>, string pattern, RegexOptions options, TimeSpan matchTimeout);
    }
}

API Usage

int numWords = Regex.Count(text, "\b\w+\b");

Alternative Designs

  • We could choose not to provide the static methods. They're just wrappers around Regex's internal cache of Regex objects.
  • Augment Regex extensibility point for better perf and span-based matching #59629 proposes an Enumerate method as the main workhorse for span-based inputs, and it doesn't materialize Match objects. Enumerate returns a ref struct. We could instead add a Count() method to that ref struct, and a consumer would write r.Enumerate(input).Count(). This would mean we wouldn't add any of the APIs outlined above and instead just add a public int Count(); method to that type.

Risks

Just additional surface area and all the typical concerns that come with that.

@stephentoub stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.RegularExpressions labels Nov 10, 2021
@stephentoub stephentoub added this to the 7.0.0 milestone Nov 10, 2021
@ghost
Copy link

ghost commented Nov 10, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

With regexes, you often want to know how many matches there are in a given piece of text. There are various ways you can do that today, including:

public static int Count(Regex r, string input)
{
    int count = 0;
    Match m = r.Match(input);
    while (m.Success)
    {
        count++;
        m = m.NextMatch();
    }
    return count;
}

or

public static int Count(Regex r, string input) => r.Matches(input).Count();

but these have a variety of downsides. In addition to being non-obvious, they're more costly than they need to be. Both of them force Match objects and all the state they wrap (captures, etc.) into existence, when for counting that's not necessary, and the latter also forces the MatchCollection into existence and holds on to all of the Match objects until after the operation has completed.

We should add Count methods to Regex for this relatively common case.

API Proposal

namespace System.Text.RegularExpressions
{
    public class Regex
    {
+        public int Count(string input);
+        public static int Count(string input, string pattern);
+        public static int Count(string input, string pattern, RegexOptions options);
+        public static int Count(string input, string pattern, RegexOptions options, TimeSpan matchTimeout);

          // And once https://github.com/dotnet/runtime/issues/59629 is approved and we support spans
+        public int Count(ReadOnlySpan<char> input);
+        public static int Count(ReadOnlySpan<char>, string pattern);
+        public static int Count(ReadOnlySpan<char>, string pattern, RegexOptions options);
+        public static int Count(ReadOnlySpan<char>, string pattern, RegexOptions options, TimeSpan matchTimeout);
    }
}

API Usage

int numWords = Regex.Count(text, "\b\w+\b");

Alternative Designs

  • We could choose not to provide the static methods. They're just wrappers around Regex's internal cache of Regex objects.

Risks

Just additional surface area and all the typical concerns that come with that.

Author: stephentoub
Assignees: -
Labels:

api-suggestion, area-System.Text.RegularExpressions

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Nov 10, 2021
@MgSam
Copy link

MgSam commented Nov 11, 2021

What is a use case where you want the count of matches and not just Regex.IsMatch()?

@stephentoub
Copy link
Member Author

stephentoub commented Nov 11, 2021

What is a use case where you want the count of matches and not just Regex.IsMatch()?

How many words are in this document?
How many times does a document mention a particular person's name?
How common is it in our codebase to use Foo vs how common is it to use Bar?
...

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 11, 2022
@joperezr
Copy link
Member

I'm definitely not against this API, and I for sure see the use cases for it, my only question is: Are there any big advantages of having this API over the current one-line workaround (Regex.Matches("\b\w\b", myInput).Count()) apart from discoverability? For example, one thing that comes to mind is that since we wouldn't need the Matches objects perhaps we can make a more performant API with less allocations as we can just reuse the internal match object?

@joperezr
Copy link
Member

Nevermind, I see you talk about this in the proposal and missed it. I think this proposal is ready to review.

@joperezr joperezr added the api-ready-for-review API is ready for review, it is NOT ready for implementation label Jan 19, 2022
@stephentoub
Copy link
Member Author

Thanks, @joperezr. I'm curious if you have any thoughts on the Alternative Designs section?

@joperezr
Copy link
Member

Regarding the static methods, I think it is fine to have them for consistency with the rest of the Regex API as well as for convenience. I feel like people more often use our static methods as opposed to creating instances, or at least I’ve noticed that they are just as popular.

Regarding using a Count on Enumerate instead, I haven’t given too much thought on that one part of the proposal yet, but at least to me it feels like discoverability of the functionality would hurt if we decide to do that instead, and perhaps more people would be inclined to find different solutions (like calling Count() on Matches result). Making this an API directly on Regex type makes it discoverable and preferable to write over the other alternatives.

That is my 2 cents from your alternative designs, but as always I’m open for discussion.

@stephentoub
Copy link
Member Author

Ok, thanks. Let's stick with the original proposal then.

I feel like people more often use our static methods as opposed to creating instances, or at least I’ve noticed that they are just as popular.

Yeah, they're just a potential pit of performance failure, so I get a little sad each time I see them used in any larger app. But they are nicely convenient.

@terrajobst
Copy link
Member

terrajobst commented Jan 25, 2022

Video

  • Looks good as proposed
  • We should consider adding an analyzer the looks for code that does Regex.Count(...) > 0 and suggests Regex.IsMatch(...)
namespace System.Text.RegularExpressions
{
    public partial class Regex
    {
        public int Count(string input);
        public static int Count(string input, string pattern);
        public static int Count(string input, string pattern, RegexOptions options);
        public static int Count(string input, string pattern, RegexOptions options, TimeSpan matchTimeout);

        // And once https://github.com/dotnet/runtime/issues/59629 is approved and we support spans
        public int Count(ReadOnlySpan<char> input);
        public static int Count(ReadOnlySpan<char> input, string pattern);
        public static int Count(ReadOnlySpan<char> input, string pattern, RegexOptions options);
        public static int Count(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
    }
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jan 25, 2022
@stephentoub stephentoub self-assigned this Jan 25, 2022
@stephentoub stephentoub removed the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jan 26, 2022
@stephentoub stephentoub removed their assignment Jan 26, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 1, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 8, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Apr 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants