Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement native support StringViewArray for regexp_is_match and regexp_is_match_scalar function, deprecate regexp_is_match_utf8 and regexp_is_match_utf8_scalar #6376

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

tlm365
Copy link

@tlm365 tlm365 commented Sep 10, 2024

Which issue does this PR close?

Closes #6370.

Rationale for this change

  1. Natively operate on StringViewArray without having to convert first to StringArray
  2. (Potentially) take advantage of the new string view layout

What changes are included in this PR?

Introduce regexp_is_match and regexp_is_match_scalar (which can replace regexp_is_match_utf8 and regexp_is_match_utf8_scalar) can perform on StringArray / LargeStringArray / StringViewArray arguments.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 10, 2024
@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

Thanks @tlm365 ❤️

I am running the benchmarks on this PR now and will report back when they are complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tlm365 -- this is looking really nice

I wonder if you might also be willing to add StringView to the benchmarks as well, specifically

fn bench_regexp_is_match_utf8_scalar(arr_a: &StringArray, value_b: &str) {
regexp_is_match_utf8_scalar(
criterion::black_box(arr_a),
criterion::black_box(value_b),
None,
)
.unwrap();

So that if this code is changed in the future we can ensure it doesn't regress in performance

);
test_flag_utf8!(
test_utf8_array_regexp_is_match_insensitive_2,
StringViewArray::from(vec!["arrow", "arrow", "arrow", "arrow", "arrow", "arrow"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringViewArray has special case handling for strings that are more than 12 bytes long (the string data is stored out of band in those cases)

Can you please add tests that have some strings that are longer than 12 bytes?

Copy link
Author

@tlm365 tlm365 Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add tests that have some strings that are longer than 12 bytes?

Yes, noted. I will review and update test cases for this scenario.

///
/// See the documentation on [`regexp_is_match_utf8`] for more details.
pub fn regexp_is_match_utf8_scalar<OffsetSize: OffsetSizeTrait>(
array: &GenericStringArray<OffsetSize>,
pub fn regexp_is_match_utf8_scalar<'a, S>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I think this is a API change (as is the above)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea of how to update this PR to avoid an API change -- the reason this is important is that a breaking API change would need to wait until the next major release (Dec 2024) per the release schedule: https://github.com/apache/arrow-rs?tab=readme-ov-file#release-versioning-and-schedule

TLDR is I think if we introduced a new function like the following:

fn regexp_is_match(
    array: &dyn Array, 
    regex_array: &dyn Array, 
    flags_array: Option<&dyn Array, >,
) -> Result<BooleanArray, ArrowError> {
..
}
``

We could then support StringView and StringArray and LargeStringArray 

Copy link
Author

@tlm365 tlm365 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR is I think if we introduced a new function like the following:

@alamb Sounds good 👍 But why do we use &dyn Array for the new regex_is_match function instead of keeping the current implementation?

Or am I misunderstanding you? I understand that we will provide a new regex_is_match function, and mark the current regex_is_match_utf8 function as:

#[deprecated(since="54.0.0", note="please use `regex_is_match` instead")]
pub fn regexp_is_match_utf8(...) { ... }

Is that right? 🤔

@alamb alamb added the api-change Changes to the arrow API label Sep 10, 2024
@tlm365
Copy link
Author

tlm365 commented Sep 10, 2024

I wonder if you might also be willing to add StringView to the benchmarks as well, specifically
So that if this code is changed in the future we can ensure it doesn't regress in performance

@alamb Thanks for reviewing, willing to add benchmark for this one. I will update it soon.

@tlm365 tlm365 marked this pull request as draft September 11, 2024 02:35
Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>
@alamb
Copy link
Contributor

alamb commented Sep 11, 2024

Here are the benchmark results (aka this PR doesn't slow down the existing implementation)

++ critcmp master regex-is-match-utf8
group                                                     master                                 regex-is-match-utf8
-----                                                     ------                                 -------------------
regexp_matches_utf8 scalar ends with                      1.02  1932.3±20.07µs        ? ?/sec    1.00  1898.4±17.40µs        ? ?/sec
regexp_matches_utf8 scalar starts with                    1.00  1932.1±14.07µs        ? ?/sec    1.00  1924.9±26.17µs        ? ?/sec

@tlm365 tlm365 changed the title Implement native support StringViewArray for regexp_is_match_utf8 and regexp_is_match_utf8_scalar function Implement native support StringViewArray for regexp_is_match and regexp_is_match_scalar function Sep 17, 2024
Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>
@tlm365 tlm365 marked this pull request as ready for review September 18, 2024 10:05
@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @tlm365 -- this looks great.

I was reviewing this PR and I had the code checked out locally, so I took the liberty of making a few changes:

  1. I fixed clippy (was failing due to using deprecated functions)
  2. I updated the comments / added an example to ease the transition (passing None as the flags argument results in type inference errors without some type help)
  3. Improved the comments in general making it clearer what regexp_is_match does and how it is related to regexp_match

/// * [`regexp_is_match_scalar`] for matching a single regular expression against an array of strings
/// * [`regexp_match`] for extracting groups from a string array based on a regular expression
///
/// # Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an example to help the migration

Screenshot 2024-09-18 at 5 40 38 PM

pub fn regexp_is_match_utf8<OffsetSize: OffsetSizeTrait>(
array: &GenericStringArray<OffsetSize>,
regex_array: &GenericStringArray<OffsetSize>,
flags_array: Option<&GenericStringArray<OffsetSize>>,
) -> Result<BooleanArray, ArrowError> {
regexp_is_match(array, regex_array, flags_array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched the implementation to just call the new function to avoid duplication

@alamb alamb changed the title Implement native support StringViewArray for regexp_is_match and regexp_is_match_scalar function Implement native support StringViewArray for regexp_is_match and regexp_is_match_scalar function, deprecate regexp_is_match_utf8 and regexp_is_match_utf8_scalar Sep 18, 2024
@alamb
Copy link
Contributor

alamb commented Sep 19, 2024

@tlm365 I wonder if you have a few minutes to review the changes I pushed to this PR.

I again I am sorry about the review delays

@tlm365
Copy link
Author

tlm365 commented Sep 19, 2024

@tlm365 I wonder if you have a few minutes to review the changes I pushed to this PR.

I again I am sorry about the review delays

@alamb Oops, thank you so much for reviewing. Sorry 🙇 I've been a little busy lately. Noted and will come back to review this weekend.

@tlm365
Copy link
Author

tlm365 commented Sep 21, 2024

Thank you very much @tlm365 -- this looks great.

I was reviewing this PR and I had the code checked out locally, so I took the liberty of making a few changes:

  1. I fixed clippy (was failing due to using deprecated functions)
  2. I updated the comments / added an example to ease the transition (passing None as the flags argument results in type inference errors without some type help)
  3. Improved the comments in general making it clearer what regexp_is_match does and how it is related to regexp_match

@alamb it looks very nice 👍 thank you so much for this update! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement regexp_is_match_utf8 and regexp_is_match_utf8_scalar for StringViewArray
2 participants