Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluates COALESCE arguments past first non-NULL value #8910

Open
tv42 opened this issue Jan 19, 2024 · 5 comments
Open

Evaluates COALESCE arguments past first non-NULL value #8910

tv42 opened this issue Jan 19, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@tv42
Copy link
Contributor

tv42 commented Jan 19, 2024

Describe the bug

Datafusion evaluates COALESCE(a, b, c) expressions even past the first non-NULL value, and bubbles up runtime errors from them.

This is wrong because the COALESCE could be explicitly protecting against e.g. the divide-by-zero case, and differs from other sql engines.

To Reproduce

SELECT COALESCE(1, 2/0);
Optimizer rule 'simplify_expressions' failed
caused by
Arrow error: Divide by zero error

Expected behavior

COALESCE to return the first non-NULL value, not bubble up errors from the rest.

Compare to SQLite:

sqlite> SELECT COALESCE(1, 2/0);
1

Compare to Postgres:

postgres=# SELECT COALESCE(1, 2/0);
 coalesce
----------
        1
(1 row)

Additional context

No response

@tv42 tv42 added the bug Something isn't working label Jan 19, 2024
@tv42
Copy link
Contributor Author

tv42 commented Jan 19, 2024

This might share the root cause with #8909.

@alamb
Copy link
Contributor

alamb commented Jan 21, 2024

This may be fixed by #8928

@jackwener
Copy link
Member

jackwener commented Jan 22, 2024

This may be fixed by #8928

This issue is related with #8928 but is different with it.

Let me have explain this problem: It's cause by expression eval.

When we try to fold expression, we will eval expression. When we eval expression, it may occur Runtime Error.
But short-circuit expression can be short-circuit in Runtime, so I think we should skip some error and then use original (a possible solution) expression when fold short-circuit expression.

#8928 avoid short-circuit expression to extract common expression, but can't avoid fold expression

how do you think about it? @alamb @liukun4515

BTW, Are @haohuaijin interested in this? If yes, when you meet some problem or need review, please feel free to @me.

@haohuaijin
Copy link
Contributor

haohuaijin commented Jan 22, 2024

Thank you @jackwener. I am interested in it. I have two solutions.

First

We can first fold the expression whenever, and if we encounter an error, return the origin expression in the below code, and the error will occur in runtime if it is a real error(not the short-circuit problem).
https://github.com/apache/arrow-datafusion/blob/2b218be67a6c412629530b812836a6cec76efc32/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L282-L287

Second

don't simplify short-circuit expressions' all sub-expressions.


I think this issue is also related to #8927, because the scalar function first evaluates all args and then execution the actual function, so even though we don't fold the expression, the query in this issue will also throw error.
I comment out SimplifyExpressions in the main branch and got the following result.

❯ SELECT COALESCE(1, 2/0);
Arrow error: Divide by zero error
❯ explain SELECT COALESCE(1, 2/0);
+---------------+-------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                |
+---------------+-------------------------------------------------------------------------------------+
| logical_plan  | Projection: coalesce(Int64(1), Int64(2) / Int64(0))                                 |
|               |   EmptyRelation                                                                     |
| physical_plan | ProjectionExec: expr=[coalesce(1, 2 / 0) as coalesce(Int64(1),Int64(2) / Int64(0))] |
|               |   PlaceholderRowExec                                                                |
|               |                                                                                     |
+---------------+-------------------------------------------------------------------------------------+
2 rows in set. Query took 0.007 seconds.

@alamb
Copy link
Contributor

alamb commented Jan 22, 2024

We can first fold the expression whenever, and if we encounter an error, return the origin expression in the below code, and the error will occur in runtime if it is a real error(not the short-circuit problem).

This makes sense to me @haohuaijin and @jackwener - I also played around with posgres and this seems consistent with what it does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants