Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize SQL IN(?, ?, ...) statements to "in(?)" to reduce cardinality of db.statement attribute #10564

Merged
merged 6 commits into from
Mar 12, 2024

Conversation

swar8080
Copy link
Contributor

Went for a simple implementation, but let me know if any of the following would be preferred:

  • A way to turn this new behaviour on/off
  • Decoupling sanitizing and normalizing sql statements into separate classes/abstractions
  • Implementing normalization as part of the jflex parsing to avoid another pass through the query string. I wouldn't have the bandwidth to implement that though

Closes #10442

@swar8080 swar8080 requested a review from a team February 15, 2024 23:27
@swar8080 swar8080 force-pushed the sql-in-statement-normalization branch from bf176d0 to 5723c1a Compare February 16, 2024 12:23
…ity of span metrics using db.statement as an attribute
@swar8080 swar8080 force-pushed the sql-in-statement-normalization branch from 5723c1a to c8dbb13 Compare February 16, 2024 12:38
Copy link
Member

@trask trask left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be done without using a regular expression? (my concern is performance)

@zeitlinger
Copy link
Member

can this be done without using a regular expression? (my concern is performance)

it's precompiled - and it doesn't use pathological patterns

@swar8080 swar8080 changed the title Normalize SQL IN(?, ?, ...) statements to "in(?)" to reduce cardinality of db.statement attribute Draft: Normalize SQL IN(?, ?, ...) statements to "in(?)" to reduce cardinality of db.statement attribute Feb 21, 2024
@swar8080
Copy link
Contributor Author

my bad, this is actually causing a stack overflow for an IN statement with 2000 values

ill see if there's a different approach

@swar8080
Copy link
Contributor Author

I switched to a simpler regular expression that a ReDos checker said is linear complexity. It no longer causes a stack overflow, which I added a test case for

Also, I did some profiles to compare performance on an IN statement that has 10k values. Didn't see a noticeable difference for what it's worth

Without the normalization:
image

With the normalization:
image

@swar8080 swar8080 changed the title Draft: Normalize SQL IN(?, ?, ...) statements to "in(?)" to reduce cardinality of db.statement attribute Normalize SQL IN(?, ?, ...) statements to "in(?)" to reduce cardinality of db.statement attribute Feb 23, 2024
@@ -52,6 +54,9 @@ WHITESPACE = [ \t\r\n]+
// max length of the sanitized statement - SQLs longer than this will be trimmed
static final int LIMIT = 32 * 1024;

private static final Pattern IN_STATEMENT_PATTERN = Pattern.compile("\\sin\\s*\\(\\s*\\?[\\s?,]*?\\)", Pattern.CASE_INSENSITIVE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also matches inputs like in (?,,,???) perhaps using "(\\sin\\s*)\\(\\s*\\?\\s*(,\\s*\\?\\s*)*\\)" and replacing with "$1(?)" would be better. These regular expressions are hard to parse, maybe we should try to document them to make them easier to understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the (,\\s*\\?\\s*)* part of that pattern causes a stack overflow for IN statements with many values

Let me know if matching invalid syntax like in (?,,,???) is ok since it'd hide info helpful for debugging bad queries. I'd guess that info's available in most sql library stack traces though

Also, added some documentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the (,\s*\?\s*)* part of that pattern causes a stack overflow for IN statements with many values

using a possessive quantifier should fix this, try "(\\sin\\s*)\\(\\s*\\?\\s*(,\\s*\\?\\s*)*+\\)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, that solves the problem

@github-actions github-actions bot added the test native This label can be applied to PRs to trigger them to run native tests label Feb 24, 2024
@laurit laurit removed the test native This label can be applied to PRs to trigger them to run native tests label Feb 26, 2024
@laurit
Copy link
Contributor

laurit commented Feb 26, 2024

closing and reopening to trigger checks

@laurit laurit closed this Feb 26, 2024
@laurit laurit reopened this Feb 26, 2024
@@ -52,6 +54,10 @@ WHITESPACE = [ \t\r\n]+
// max length of the sanitized statement - SQLs longer than this will be trimmed
static final int LIMIT = 32 * 1024;

// Match on "IN(?, ?, ...)"
private static final Pattern IN_STATEMENT_PATTERN = Pattern.compile("(\\sin\\s*)\\(\\s*\\?\\s*(,\\s*\\?\\s*)*+\\)", Pattern.CASE_INSENSITIVE);
private static final String IN_STATEMENT_NORMALIZED = " in(?)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static final String IN_STATEMENT_NORMALIZED = " in(?)";
private static final String IN_STATEMENT_NORMALIZED = "$1(?)";

Sanitizer does not change case or remove whitespace from the original query. Lets keep in as it was in the original query. longInStatementDoesntCauseStackOverflow will break after this change as there is a space between in and ( that is currently removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I updated it to preserve case and whitespace, and updated the test cases to check for that. I didn't add a test case for more than one space between IN and ( since strings like IN (?) get sanitized to IN (?)

Also switched to a non-capturing group for matching on the part in-between the brackets as a small optimization

@swar8080 swar8080 closed this Feb 26, 2024
@swar8080 swar8080 reopened this Feb 26, 2024
@swar8080 swar8080 force-pushed the sql-in-statement-normalization branch from f762e93 to c52df11 Compare February 26, 2024 23:52
@swar8080 swar8080 closed this Feb 27, 2024
@swar8080 swar8080 reopened this Feb 27, 2024
@swar8080
Copy link
Contributor Author

swar8080 commented Mar 5, 2024

HI @laurit, let me know if anything else should be done here

Would be great to have this in 2.2.0

@trask trask added this to the v2.2.0 milestone Mar 5, 2024
@trask trask merged commit f6a12b4 into open-telemetry:main Mar 12, 2024
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Normalize "IN(?, ?, ...)" substrings in db.statement span attribute to "IN(?)"
4 participants