Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BQ Pushdown support for Group By Plugin aggregators: Sum Of Squares,Corrected Sum of Squares #1580

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

sanjanasandeep
Copy link
Contributor

No description provided.

@sanjanasandeep sanjanasandeep added the build Trigger unit test build label Apr 13, 2022
@sanjanasandeep sanjanasandeep self-assigned this Apr 13, 2022
@@ -124,6 +124,8 @@ public class GroupByAggregator extends RecordReducibleAggregator<AggregateResult
put(GroupByConfig.Function.CONCATDISTINCT, "STRING_AGG(DISTINCT CAST(%s AS STRING) , \", \")");
put(GroupByConfig.Function.LOGICALAND, "COALESCE(LOGICAL_AND(%s), TRUE)");
put(GroupByConfig.Function.LOGICALOR, "COALESCE(LOGICAL_OR(%s), FALSE)");
put(GroupByConfig.Function.SUMOFSQUARES, "SUM(POW(%s, 2))");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the ANSI SQL POWER function. POW is a BigQuery-only method name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you mean SUM(POWER(%s, 2)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

Also, the Spark function returns 0 if the input is empty, so we need to wrap this in a conditional block:

CASE WHEN COUNT(%s) > 0 THEN <this expression> ELSE 0 END

See https://github.com/cdapio/hydrator-plugins/pull/1575/files#diff-f285497c578d095fc4c85b1968df3e4dc52220cf71458410b5e1a347081f531fR121 for an example

@@ -124,6 +124,8 @@ public class GroupByAggregator extends RecordReducibleAggregator<AggregateResult
put(GroupByConfig.Function.CONCATDISTINCT, "STRING_AGG(DISTINCT CAST(%s AS STRING) , \", \")");
put(GroupByConfig.Function.LOGICALAND, "COALESCE(LOGICAL_AND(%s), TRUE)");
put(GroupByConfig.Function.LOGICALOR, "COALESCE(LOGICAL_OR(%s), FALSE)");
put(GroupByConfig.Function.SUMOFSQUARES, "SUM(POW(%s, 2))");
put(GroupByConfig.Function.CORRECTEDSUMOFSQUARES, "SUM(POW(%s, 2)) - SUM(%s)/COUNT(%s)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This format requires 3 arguments instead of one. In order to repeat the same element multiple times, see line 121:

String.format("%s %<s %<s", "bla") -> bla bla bla

Similar to the previous method, we need to use POWER (Ansi SQL) instead of POW (BigQuery specific). This way we can move this to the Ansi block. Please rebase with the latest.

The second SUM statement should be elevated to the power of 2.

Finally, this method will have a division by zero if COUNT(exp) = 0. Since the Spark implementation returns 0 if there are no entries, we can use a case statement to solve this:

CASE WHEN COUNT(%s) > 0 THEN <this expression> ELSE 0 END

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Trigger unit test build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants