-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement posgres style encode
/decode
#6821
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you thank you @ozgrakkurt
Something is not quite right
I tried playing around with these functions in datafusion-cli and I got some strange results
Like I expect this to return 'foo'
but it did not (I see one of your test uses arrow_cast
maybe?):
❯ select decode(encode('foo', 'hex'), 'hex');
+-----------------------------------------------------+
| decode(encode(Utf8("foo"),Utf8("hex")),Utf8("hex")) |
+-----------------------------------------------------+
| 666f6f |
+-----------------------------------------------------+
1 row in set. Query took 0.003 seconds.
I also don't think the code that takes ColumnarValue::Array
is covered by tests
Thus I think this PR needs more tests, ideally via sqllogictests -- perhaps in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/test_files/functions.slt
I think that would make the behavior of these functions clearer.
Cases needing coverage
- Scalar input (of the string/largestring/binary/largebinary) -- real values, empty
''
string, and NULL - table input for each of the types (maybe a single table with different columns?
- both
hex
andbase64
arguments
Need update to the feature flags
Note that when I initially tried to test it out, I got the following error (I think because the main datafusion feature flags were not updated)
DataFusion CLI v27.0.0
❯ select encode('foo', 'base64');
Optimizer rule 'simplify_expressions' failed
caused by
Internal error: function encode requires compilation with feature flag: encoding_expressions.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
❯
I had to add this to Cargo:
diff --git a/datafusion/core/Cargo.toml b/datafusion/core/Cargo.toml
index 468219470..4e0e2fde5 100644
--- a/datafusion/core/Cargo.toml
+++ b/datafusion/core/Cargo.toml
@@ -38,10 +38,11 @@ path = "src/lib.rs"
avro = ["apache-avro", "num-traits", "datafusion-common/avro"]
compression = ["xz2", "bzip2", "flate2", "zstd", "async-compression"]
crypto_expressions = ["datafusion-physical-expr/crypto_expressions", "datafusion-optimizer/crypto_expressions"]
-default = ["crypto_expressions", "regex_expressions", "unicode_expressions", "compression"]
+default = ["crypto_expressions", "encoding__expressions", "regex_expressions", "unicode_expressions", "compression"]
# Enables support for non-scalar, binary operations on dictionaries
# Note: this results in significant additional codegen
dictionary_expressions = ["datafusion-physical-expr/dictionary_expressions", "datafusion-optimizer/dictionary_expressions"]
+encoding__expressions = ["datafusion-physical-expr/encoding_expressions"]
# Used for testing ONLY: causes all values to hash to the same value (test for collisions)
force_hash_collisions = []
pyarrow = ["datafusion-common/pyarrow"]
Once I did that and tested it out it worked great:
❯ create table foo as values ('foo', 'base64');
0 rows in set. Query took 0.002 seconds.
❯ select * from foo;
+---------+---------+
| column1 | column2 |
+---------+---------+
| foo | base64 |
+---------+---------+
1 row in set. Query took 0.002 seconds.
❯ select encode(column1, 'base64') from foo;
+------------------------------------+
| encode(foo.column1,Utf8("base64")) |
+------------------------------------+
| Zm9v |
+------------------------------------+
1 row in set. Query took 0.003 seconds.
❯ select encode(column1, column2) from foo;
Internal error: Encode using dynamically decided method is not yet supported. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
❯ select encode(column1, 'hex') from foo;
+---------------------------------+
| encode(foo.column1,Utf8("hex")) |
+---------------------------------+
| 666f6f |
+---------------------------------+
ColumnarValue::Array(_) => Err(DataFusionError::Internal( | ||
"Encode using dynamically decided method is not yet supported".into(), | ||
)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given a user can cause this error I don't think it is internal (aka it isn't a bug in datafusion, which is what internal errors are for).
Also I think it would help to make the error a bit more user friendly
ColumnarValue::Array(_) => Err(DataFusionError::Internal( | |
"Encode using dynamically decided method is not yet supported".into(), | |
)), | |
ColumnarValue::Array(_) => Err(DataFusionError::NotYetImplemented( | |
"Second argument to encode must be a constant: Encode using dynamically decided method is not yet supported".into(), | |
)), |
"Unsupported data type {other:?} for function decode", | ||
))), | ||
}, | ||
ColumnarValue::Array(_) => Err(DataFusionError::Internal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this should be a different error and have a nicer message
This is probably because decode returns binary. It has to return binary since that value isn't guaranteed to be valid utf8 and Working on the error and test issues |
@alamb edit: I forgot the errors, doing it now |
@alamb I think I got everything, can you check? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me -- thank you @ozgrakkurt
I also made a PR to add these functions to the user guide: #6856
no problem, thanks as well! |
* feat: add encode, decode functions * add test * add licenses * fix return types * delete files * toml fmt * add logical expr * fix NULL case, add test for NULL and empty * add sqllogic tests * update error msgs, run cargo update in cli dir * update sqllogictest * add more tests
* feat: add encode, decode functions * add test * add licenses * fix return types * delete files * toml fmt * add logical expr * fix NULL case, add test for NULL and empty * add sqllogic tests * update error msgs, run cargo update in cli dir * update sqllogictest * add more tests
closes #3125
Need help with test failure, couldn't understand what it means: