Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor function registry for multi-stage engine #13573

Merged
merged 4 commits into from
Jul 17, 2024

Conversation

Jackie-Jiang
Copy link
Contributor

Here are the main changes:

  • Do not register any function in the catalog. PinotCatalog is just a wrapper over table cache to resolve database name and extract table schema.
  • Register all function signatures into PinotOperatorTable
  • The following functions are registered in the PinotOperatorTable:
    • Selected standard function operators from Calcite SqlStdOperatorTable
    • Pinot custom function operators
    • Aggregation function types
    • Transform function types
    • Scalar functions
  • Customize function lookup to follow the single-stage engine convention: ignore case and underscore within function names
  • Add framework to support scalar function class with polymorphism

@Jackie-Jiang Jackie-Jiang added release-notes Referenced by PRs that need attention when compiling the next release notes bugfix cleanup refactor multi-stage Related to the multi-stage query engine labels Jul 9, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jul 9, 2024

Codecov Report

Attention: Patch coverage is 0.70053% with 567 lines in your changes missing coverage. Please review.

Project coverage is 27.72%. Comparing base (59551e4) to head (96040f0).
Report is 779 commits behind head on master.

Files Patch % Lines
...che/pinot/segment/spi/AggregationFunctionType.java 0.00% 132 Missing ⚠️
...apache/pinot/common/function/FunctionRegistry.java 0.00% 97 Missing ⚠️
...e/pinot/common/function/TransformFunctionType.java 0.00% 86 Missing ⚠️
...ache/pinot/calcite/sql/fun/PinotOperatorTable.java 0.00% 68 Missing ⚠️
...rg/apache/pinot/common/function/FunctionUtils.java 0.00% 26 Missing ⚠️
...ot/calcite/rel/rules/PinotEvaluateLiteralRule.java 0.00% 22 Missing ⚠️
...ery/runtime/operator/operands/FunctionOperand.java 0.00% 19 Missing ⚠️
...el/rules/PinotAggregateExchangeNodeInsertRule.java 0.00% 18 Missing ⚠️
.../java/org/apache/pinot/query/QueryEnvironment.java 0.00% 13 Missing ⚠️
...r/transform/function/TransformFunctionFactory.java 0.00% 12 Missing ⚠️
... and 19 more

❗ There is a different number of reports uploaded between BASE (59551e4) and HEAD (96040f0). Click for more details.

HEAD has 15 uploads less than BASE
Flag BASE (59551e4) HEAD (96040f0)
temurin 12 10
java-21 7 6
skip-bytebuffers-true 3 2
skip-bytebuffers-false 7 5
unittests 5 1
unittests1 2 0
java-11 5 4
unittests2 3 1
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #13573       +/-   ##
=============================================
- Coverage     61.75%   27.72%   -34.03%     
+ Complexity      207      198        -9     
=============================================
  Files          2436     2553      +117     
  Lines        133233   140470     +7237     
  Branches      20636    21851     +1215     
=============================================
- Hits          82274    38949    -43325     
- Misses        44911    98533    +53622     
+ Partials       6048     2988     -3060     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 27.72% <0.70%> (-33.99%) ⬇️
java-21 <0.01% <0.00%> (-61.63%) ⬇️
skip-bytebuffers-false 27.72% <0.70%> (-34.02%) ⬇️
skip-bytebuffers-true <0.01% <0.00%> (-27.73%) ⬇️
temurin 27.72% <0.70%> (-34.03%) ⬇️
unittests 27.72% <0.70%> (-34.03%) ⬇️
unittests1 ?
unittests2 27.72% <0.70%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jackie-Jiang, this is a really nice improvement with lots of cleanups! I had a few minor comments and questions to better my understanding of these areas.

return call;
public boolean canReduce(AggregateCall call) {
SqlKind kind = call.getAggregation().getKind();
return kind == SqlKind.SUM || kind == SqlKind.AVG;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this customized rule? Which of the original Calcite rule's reductions don't work in Pinot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some javadoc to make it clear. Currently STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP, COVAR_POP, COVAR_SAMP breaks because of the original rule. Take a look at the changes in StatisticAggregates.json

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, makes sense now, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this rule applies in the leaf stage or also in the intermediate stage? How we also merge data from different workers in the not simpler form?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this rule is applied, we don't have leaf stage concept yet (leaf stage is determined with PinotAggregateExchangeNodeInsertRule).
We don't really need these rewrite because Pinot can directly handle SUM and AVG with proper null handling (the rule is needed for engine without proper null handling). I didn't directly remove the rule because that is out of the scope of this PR, and null handling support requires some more tweaks.

@Jackie-Jiang Jackie-Jiang force-pushed the function_registry branch 3 times, most recently from c6f2303 to 5447423 Compare July 11, 2024 02:34
@yashmayya
Copy link
Collaborator

Looks like this test needs to be updated with the new exception type and message due to the changes in FunctionOperand

Copy link
Contributor

@gortiz gortiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need more time to finish the first read and probably have a second one :D

Comment on lines +210 to +236
@Deprecated
@Nullable
public static FunctionInfo getFunctionInfo(String name, int numArguments) {
return lookupFunctionInfo(canonicalize(name), numArguments);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even recommend to create a class called CannonicalName that contains a String. Then use that class as input. We may have a static class that transforms Strings into CanonicalNames.

This will let us:

  • Have type safe checks, so we Java don't let us call lookupFunctionInfo with non canonical names.
  • We can cache the CanonicalNames, so we don't need to allocate.

Comment on lines +299 to +326
public FunctionInfo getFunctionInfo(int numArguments) {
FunctionInfo functionInfo = _functionInfoMap.get(numArguments);
return functionInfo != null ? functionInfo : _functionInfoMap.get(VAR_ARG_KEY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cool to have documented somewhere how functions can be registered. AFAIU that should be something like:

  • Using annotated methods: Simpler and shorted but less expressive. For example, you cannot support polymorphism.
  • Using annotated classes that implement PinotScalarFunction: More expressive.

By @Jackie-Jiang comment here it looks like there is a third way that consist on registering the function explicitly in PinotOperatorTable. But function won't be usable in V1, am I right?

return call;
public boolean canReduce(AggregateCall call) {
SqlKind kind = call.getAggregation().getKind();
return kind == SqlKind.SUM || kind == SqlKind.AVG;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this rule applies in the leaf stage or also in the intermediate stage? How we also merge data from different workers in the not simpler form?

@Jackie-Jiang Jackie-Jiang force-pushed the function_registry branch 6 times, most recently from 36f30a4 to d8e9770 Compare July 17, 2024 00:19
Comment on lines +68 to +76
if (functionInfo == null) {
if (FunctionRegistry.contains(canonicalName)) {
throw new IllegalArgumentException(
String.format("Unsupported function: %s with argument types: %s", functionName,
Arrays.toString(argumentTypes)));
} else {
throw new IllegalArgumentException(String.format("Unsupported function: %s", functionName));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should also include the variants that were not selected.

We could do that by adding a method in FunctionRegistry that returns all FunctionInfo for a given name and then showing these options here.

I don't think it has to be added in this PR, but it is something that will be useful to debug problems

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little bit tricky though because the matches could be done via the type inference. We can probably add some usage info within each PinotScalarFunction which can be lookup up in the FunctionRegistry. Added a TODO to follow up

Comment on lines 152 to +153
"sql": "SELECT PERCENTILE_TDIGEST(float_col, 50), PERCENTILE_TDIGEST(double_col, 5), PERCENTILE_TDIGEST(int_col, 75), PERCENTILE_TDIGEST(long_col, 75) FROM {tbl}",
"outputs": [[1.75, 1.0, 137, 137]]
"outputs": [[1.75, 1.0, 137.75, 137.75]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are changing the return type, right? Couldn't this produce problems in production? Does it only happen with PERCETILE_TDIGEST or are other aggregations that can change their type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only PERCENTILE_TDIGEST where we registered the wrong return type before. This is actually a bugfix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix cleanup multi-stage Related to the multi-stage query engine refactor release-notes Referenced by PRs that need attention when compiling the next release notes v1v2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants