Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement posgres style encode/decode #6821

Merged
merged 12 commits into from
Jul 5, 2023

Conversation

ozgrakkurt
Copy link
Contributor

@ozgrakkurt ozgrakkurt commented Jul 1, 2023

closes #3125

Need help with test failure, couldn't understand what it means:
thread 'sql::expr::test_encoding_expressions' panicked at 'called `Result::unwrap()` on an `Err` value: Context("simplify_expressions", Internal("Optimizer rule 'simplify_expressions' failed, due to generate a different schema, original schema: DFSchema { fields: [DFField { qualifier: None, field: Field { name: \"encode(Utf8(\\\"tom\\\"),Utf8(\\\"base64\\\"))\", data_type: Binary, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {} }, new schema: DFSchema { fields: [DFField { qualifier: None, field: Field { name: \"encode(Utf8(\\\"tom\\\"),Utf8(\\\"base64\\\"))\", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {} }"))', datafusion/core/tests/sql/mod.rs:556:38

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate labels Jul 1, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you thank you @ozgrakkurt

Something is not quite right

I tried playing around with these functions in datafusion-cli and I got some strange results

Like I expect this to return 'foo' but it did not (I see one of your test uses arrow_cast maybe?):

❯ select decode(encode('foo', 'hex'), 'hex');
+-----------------------------------------------------+
| decode(encode(Utf8("foo"),Utf8("hex")),Utf8("hex")) |
+-----------------------------------------------------+
| 666f6f                                              |
+-----------------------------------------------------+
1 row in set. Query took 0.003 seconds.

I also don't think the code that takes ColumnarValue::Array is covered by tests

Thus I think this PR needs more tests, ideally via sqllogictests -- perhaps in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/test_files/functions.slt

I think that would make the behavior of these functions clearer.

Cases needing coverage

  1. Scalar input (of the string/largestring/binary/largebinary) -- real values, empty '' string, and NULL
  2. table input for each of the types (maybe a single table with different columns?
  3. both hex and base64 arguments

Need update to the feature flags

Note that when I initially tried to test it out, I got the following error (I think because the main datafusion feature flags were not updated)

DataFusion CLI v27.0.0
❯ select encode('foo', 'base64');
Optimizer rule 'simplify_expressions' failed
caused by
Internal error: function encode requires compilation with feature flag: encoding_expressions.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

I had to add this to Cargo:

diff --git a/datafusion/core/Cargo.toml b/datafusion/core/Cargo.toml
index 468219470..4e0e2fde5 100644
--- a/datafusion/core/Cargo.toml
+++ b/datafusion/core/Cargo.toml
@@ -38,10 +38,11 @@ path = "src/lib.rs"
 avro = ["apache-avro", "num-traits", "datafusion-common/avro"]
 compression = ["xz2", "bzip2", "flate2", "zstd", "async-compression"]
 crypto_expressions = ["datafusion-physical-expr/crypto_expressions", "datafusion-optimizer/crypto_expressions"]
-default = ["crypto_expressions", "regex_expressions", "unicode_expressions", "compression"]
+default = ["crypto_expressions", "encoding__expressions", "regex_expressions", "unicode_expressions", "compression"]
 # Enables support for non-scalar, binary operations on dictionaries
 # Note: this results in significant additional codegen
 dictionary_expressions = ["datafusion-physical-expr/dictionary_expressions", "datafusion-optimizer/dictionary_expressions"]
+encoding__expressions = ["datafusion-physical-expr/encoding_expressions"]
 # Used for testing ONLY: causes all values to hash to the same value (test for collisions)
 force_hash_collisions = []
 pyarrow = ["datafusion-common/pyarrow"]

Once I did that and tested it out it worked great:

❯ create table foo as values ('foo', 'base64');
0 rows in set. Query took 0.002 seconds.
❯ select * from foo;
+---------+---------+
| column1 | column2 |
+---------+---------+
| foo     | base64  |
+---------+---------+
1 row in set. Query took 0.002 seconds.
❯ select encode(column1, 'base64') from foo;
+------------------------------------+
| encode(foo.column1,Utf8("base64")) |
+------------------------------------+
| Zm9v                               |
+------------------------------------+
1 row in set. Query took 0.003 seconds.
❯ select encode(column1, column2) from foo;
Internal error: Encode using dynamically decided method is not yet supported. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
❯ select encode(column1, 'hex') from foo;
+---------------------------------+
| encode(foo.column1,Utf8("hex")) |
+---------------------------------+
| 666f6f                          |
+---------------------------------+

Comment on lines 309 to 311
ColumnarValue::Array(_) => Err(DataFusionError::Internal(
"Encode using dynamically decided method is not yet supported".into(),
)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given a user can cause this error I don't think it is internal (aka it isn't a bug in datafusion, which is what internal errors are for).

Also I think it would help to make the error a bit more user friendly

Suggested change
ColumnarValue::Array(_) => Err(DataFusionError::Internal(
"Encode using dynamically decided method is not yet supported".into(),
)),
ColumnarValue::Array(_) => Err(DataFusionError::NotYetImplemented(
"Second argument to encode must be a constant: Encode using dynamically decided method is not yet supported".into(),
)),

"Unsupported data type {other:?} for function decode",
))),
},
ColumnarValue::Array(_) => Err(DataFusionError::Internal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this should be a different error and have a nicer message

@ozgrakkurt
Copy link
Contributor Author

ozgrakkurt commented Jul 2, 2023

666f6f

This is probably because decode returns binary. It has to return binary since that value isn't guaranteed to be valid utf8 and encode always returns utf8 since the output is guaranteed to be utf8 and it can be converted to binary if wanted.

Working on the error and test issues

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 2, 2023
@ozgrakkurt
Copy link
Contributor Author

ozgrakkurt commented Jul 2, 2023

@alamb
All done except I couldn't find how to force a value to be largebinary or largeutf8 in sqllogictest, can you help me with this?

edit: I forgot the errors, doing it now

@ozgrakkurt
Copy link
Contributor Author

@alamb I think I got everything, can you check?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @ozgrakkurt

I also made a PR to add these functions to the user guide: #6856

@alamb alamb merged commit 4bb54e4 into apache:main Jul 5, 2023
20 checks passed
@ozgrakkurt
Copy link
Contributor Author

Looks great to me -- thank you @ozgrakkurt

I also made a PR to add these functions to the user guide: #6856

no problem, thanks as well!

2010YOUY01 pushed a commit to 2010YOUY01/arrow-datafusion that referenced this pull request Jul 5, 2023
* feat: add encode, decode functions

* add test

* add licenses

* fix return types

* delete files

* toml fmt

* add logical expr

* fix NULL case, add test for NULL and empty

* add sqllogic tests

* update error msgs, run cargo update in cli dir

* update sqllogictest

* add more tests
alamb pushed a commit to alamb/datafusion that referenced this pull request Jul 6, 2023
* feat: add encode, decode functions

* add test

* add licenses

* fix return types

* delete files

* toml fmt

* add logical expr

* fix NULL case, add test for NULL and empty

* add sqllogic tests

* update error msgs, run cargo update in cli dir

* update sqllogictest

* add more tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for hex encoding/decoding for Binary in sql queries
2 participants