Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve case for RowType's field name and JSON content when CAST #21869

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented Feb 6, 2024

Description

As discussed in #21866 and #21602, add a config property legacy_json_cast to control and support the legacy behavior which do not reserve the case of field name in json when casting to row type.

The deferences in behavior with and without the config property could be found in presto-docs/src/main/sphinx/functions/json.rst.

Impact

Add a new config property legacy_json_cast whose default value is true to support legacy behavior. When set the property legacy_json_cast to false in both coordinators and workers' config file, there will be some changes in behavior of casting from a json to a row.

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* Add configuration property `legacy_json_cast` whose default value is `true`. See [Properties Reference](http://prestodb.io/docs/current/admin/properties.html#legacy-compatible-properties). 

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd It is a big hard to understand what the change is without an example. Would you update documentation to clarify the behavior with and without the new config property?

https://prestodb.io/docs/0.285/functions/json.html

CC: @aditi-pandit @amitkdutta @spershin Folks, we need to check which behavior is implemented in Velox and whether we need to provide both options.

Copy link

github-actions bot commented Feb 6, 2024

Codenotify: Notifying subscribers in CODENOTIFY files for diff 6c2bf82...8b13091.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/admin/properties.rst
presto-docs/src/main/sphinx/functions/json.rst

@hantangwangd
Copy link
Member Author

Hi @mbasmanova, I have added some notes and examples to presto-docs/src/main/sphinx/functions/json.rst, trying to describe the behavior with and without the new config property. Please take a look, thanks.

@@ -71,6 +71,28 @@ Cast from JSON
SELECT CAST(JSON '{"k1": [1, 23], "k2": 456}' AS MAP(VARCHAR, JSON)); -- {k1 = JSON '[1,23]', k2 = JSON '456'}
SELECT CAST(JSON '[null]' AS ARRAY(JSON)); -- [JSON 'null']

.. note::

When casting from ``JSON`` to ``ROW``, by default the case of field names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd Thank you for adding this doc. This is helpful.

@@ -300,6 +300,7 @@ public final class SystemSessionProperties
public static final String ADD_PARTIAL_NODE_FOR_ROW_NUMBER_WITH_LIMIT = "add_partial_node_for_row_number_with_limit";
public static final String REWRITE_CASE_TO_MAP_ENABLED = "rewrite_case_to_map_enabled";
public static final String FIELD_NAMES_IN_JSON_CAST_ENABLED = "field_names_in_json_cast_enabled";
public static final String LEGACY_JSON_CAST = "legacy_json_cast";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @pranjalssh @tdcmeehan @majetideepak @aditi-pandit @spershin

We may need to add logic to check this session property and fail fast if native workers do not support it. Otherwise, we'll get wrong results when processing queries with Prestissimo.

@mbasmanova
Copy link
Contributor

Add a new session property legacy_json_cast whose default value is true to be compatible with existing user cases.

I wonder if it is better to not allow changing this on a per-query basis and rather have this as a cluster-wide config.

In general, session properties are not supposed to affect the results of the query.

@spershin
Copy link
Contributor

spershin commented Feb 6, 2024

@hantangwangd

What is the use case for this feature?
Why it needs to be added?

@hantangwangd
Copy link
Member Author

hantangwangd commented Feb 7, 2024

I wonder if it is better to not allow changing this on a per-query basis and rather have this as a cluster-wide config.

You mean that it's better to let it be a config property which could only be set in config.properties?

@tdcmeehan
Copy link
Contributor

Add a new session property legacy_json_cast whose default value is true to be compatible with existing user cases.

I wonder if it is better to not allow changing this on a per-query basis and rather have this as a cluster-wide config.

In general, session properties are not supposed to affect the results of the query.

I am in agreement that this should probably be just a config property. I also want to see if we have alignment that this is a bug fix, and should default the value instead to false, and add a release note indicating the new behavior. In general, I think if we keep it enabled, practically speaking no one will disable it, which means the bug doesn't get fixed. If someone has batch pipelines and would prefer not to regress them, they can opt to simply enabling the config. Thoughts?

@tdcmeehan
Copy link
Contributor

@spershin please see attached documentation and #20701

@hantangwangd
Copy link
Member Author

What is the use case for this feature? Why it needs to be added?

Hi @spershin. In some cases, user want to preserve the case for their row type's filed name, for example provided by @yhwang in issue #20701, when we execute the following statement:

SELECT MAP(ARRAY['myFirstRow', 'mySecondRow'], ARRAY[cast(row('row1FieldValue1', 'row1FieldValue2') as row("firstField" varchar, "secondField" varchar)), cast(row('row2FieldValue1', 'row2FieldValue2') as row("firstField" varchar, "secondField" varchar))]) as mapField;

We hope to get the following result:

{mySecondRow={firstField=row2FieldValue1, secondField=row2FieldValue2}, myFirstRow={firstField=row1FieldValue1, secondField=row1FieldValue2}} 

rather than:

 {mySecondRow={firstfield=row2FieldValue1, secondfield=row2FieldValue2}, myFirstRow={firstfield=row1FieldValue1, secondfield=row1FieldValue2}} 

And when it comes to casting from JSON to ROW, the situation may seems more serious for some users, as it seems that their original data in json has been changed. Moreover, in some cases when we execute:

select cast(JSON '{"firstField":"row1FieldValue1","firstfield":"row1FieldValue2"}' as row("firstField" varchar, "firstfield" varchar));

We want to get the following result rather than an exception:

{firstField=row1FieldValue1, firstfield=row1FieldValue2} 

So maybe it's better to give users who really intend to preserve the case a choice to achieve this.

@hantangwangd hantangwangd force-pushed the preserve_case_when_cast branch 2 times, most recently from 677c9e1 to 5bba02d Compare February 7, 2024 11:33
@mbasmanova
Copy link
Contributor

@tdcmeehan

I also want to see if we have alignment that this is a bug fix, and should default the value instead to false, and add a release note indicating the new behavior. In general, I think if we keep it enabled, practically speaking no one will disable it, which means the bug doesn't get fixed.

I agree that it is nice when default values include latest bug fixes and not legacy behavior. At the same time, it is important to avoid regressions to existing users as well.

If someone has batch pipelines and would prefer not to regress them, they can opt to simply enabling the config.

I would argue that noone wants to introduce regressions and enabling the config is not as easy as it sounds. Folks need to have a super easy way to learn that they need to modify configs when upgrading to a new release. Wondering if we already have a section at the top of release notes that lists all the breaking changes and whether it is possible to easily access a combined list of breaking changes between any two releases.

Perhaps, we could introduce a single property that says "Do not break me". When this property is set, the defaults for breaking configs are set to match legacy behavior. This would allow existing users to set a single property once and not worry about new release breaking them. At the same time new users will not have this config set and will be getting latest and greatest.

I also want to see if we have alignment that this is a bug fix

This is something I'm not sure about and would like to get more clarity. In JSON, keys are case sensitive, right? But in SQL, column names are not? Or is it that column names are not case sensitive, but struct fields are? Or is it that both column names and struct fields are case insensitive unless they are in double quotes?

How do we ensure that field names in RowType preserve their case throughout the query execution? Do we rely on the planner to sort things out and refer to struct fields by index in the query plan and not by name?

Do we need any support from execution (Prestissimo) for this functionality?

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the documentation! I made a few suggestions, some nits, for improving the readability. If you have questions about my suggestions, let me know and we'll figure it out together. Thanks!

presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
@hantangwangd
Copy link
Member Author

@steveburnett Thanks for your suggestions. They have been fixed, please take a look when convenient.

@hantangwangd
Copy link
Member Author

Hi @tdcmeehan @mbasmanova, following your guidance, property legacy_json_cast has been limited to a config property, and could not be affected by session property setting. And the default value has been changed to false, document and test cases have been changed correspondingly. Please take a look, thanks!

@hantangwangd
Copy link
Member Author

Hi @rschlussel @tdcmeehan, I have dropped the deprecated test methods, and added a new exception InvalidTypeDefinitionException to avoid throw a verify failure. Please take a look when available, thanks a lot!

@hantangwangd hantangwangd force-pushed the preserve_case_when_cast branch 2 times, most recently from 3453667 to a7a560a Compare April 11, 2024 16:37
Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question about the default config value, and a small comment, but generally the change looks good. Also, I feel this needs a highlight in the release notes and not just a regular release note, as it is a behavior change that could cause query failures.

@hantangwangd
Copy link
Member Author

question about the default config value, and a small comment, but generally the change looks good. Also, I feel this needs a highlight in the release notes and not just a regular release note, as it is a behavior change that could cause query failures.

@rschlussel thanks for your review! I'm OK about the default config value, as it's involving the users' current behavior. @tdcmeehan What is your opinion?

A highlight in the release notes sounds great!

@tdcmeehan
Copy link
Contributor

We can target this PR for 0.287 with it disabled by default, then change the default in advance of the cut for 0.288.

@hantangwangd hantangwangd force-pushed the preserve_case_when_cast branch 2 times, most recently from 534a263 to 3866468 Compare April 14, 2024 05:28
@hantangwangd
Copy link
Member Author

@rschlussel @tdcmeehan @steveburnett I have changed the default config value legacy-json-cast to true for now, and modify the docs accordingly. Please take a look when convenient, thanks!

rschlussel
rschlussel previously approved these changes Apr 15, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the docs! I made some minor suggestions and nits about formatting and phrasing. Let me know what you think!

* **Type:** ``boolean``
* **Default value:** ``true``

When casting from ``JSON`` to ``ROW``, by default ignore the case of field names in RowType when casting from ``JSON`` to ``ROW`` (for legacy support), the matching would be always case-insensitive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When casting from ``JSON`` to ``ROW``, by default ignore the case of field names in RowType when casting from ``JSON`` to ``ROW`` (for legacy support), the matching would be always case-insensitive.
When casting from ``JSON`` to ``ROW``, ignore the case of field names in ``RowType`` when casting from ``JSON`` to ``ROW`` for legacy support so that the matching is case-insensitive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the second 'when casting from JSON to ROW'. It seems a little duplicate, is that ok?

presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
@hantangwangd
Copy link
Member Author

@steveburnett Thanks for your detailed suggestions. They are all fixed, only one minor modification. Please take a look when available.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick revision! One tiny nit I should have caught earlier, but everything else looks great.

presto-docs/src/main/sphinx/functions/json.rst Outdated Show resolved Hide resolved
Preserve the case of double quoted field names in RowType. When casting
from ``JSON`` to ``ROW``, treat the double quoted field name as case
sensitive and unquoted field name as case insensitive in field
matching.

Add a config property `legacy_json_cast` to control and support the
legacy behavior which do not enforce the case when matching.
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local build of docs, everything looks good. Thanks!

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % the error handling can be cleaned up.

@@ -171,6 +173,10 @@ private static ErrorCode toErrorCode(Throwable throwable)
return SLICE_TOO_LARGE.toErrorCode();
}

if (throwable instanceof InvalidTypeDefinitionException) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than catch and redefine, let's just directly throw a PrestoException with INVALID_TYPE_DEFINITION.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't throw a PrestoException from presto-common, as it's defined in presto-spi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, missed that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants