Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1374896 unify structured types string representation #1882

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

sfc-gh-mkubik
Copy link
Contributor

@sfc-gh-mkubik sfc-gh-mkubik commented Sep 2, 2024

Overview

SNOW-1374896

Build string representations of Snowflake structured types recursively to reuse existing converters design for specific logical types (e.g. timestamps/binary)

Code replaces the existing structured types converters implementation that was running the native getObject method with a solution that utilises reading a field vectors within the structured type and running a proper converter on each nested type. Changes are made to Array, Map and Struct converters, helper methods are added to ArrowVectorConverter interface and new ArrowStringRepresentationBuilder classes that abstract away the logic of actually building a string object out of the arrow structured type.


Follow ups:

  • pretty print - currently the builders don't add new lines or tabs to the string representation as I think it makes the code more readable but the downside is that it causes some divergence between ARROW and JSON (which is pretty printed). Potential solution is adding some setting that enables pretty print and converting it once the string is built (to avoid passing the depth to recursive toString calls)
  • recursive call of ARROW converters returns null while for JSON there's undefined which also is some kind of divergence but not necessarily something to fix as ARROW's null sounds more reasonable

example for SELECT [12, 10, 5, NULL]::ARRAY(DOUBLE)

JSON                                                         | ARROW
[                                                            | [12.0,10.0,5.0,null]
  1.200000000000000e+01,
  1.000000000000000e+01,
  5.000000000000000e+00,
  undefined
]

Pre-review self checklist

  • PR branch is updated with all the changes from master branch
  • The code is correctly formatted (run mvn -P check-style validate)
  • New public API is not unnecessary exposed (run mvn verify and inspect target/japicmp/japicmp.html)
  • The pull request name is prefixed with SNOW-XXXX:
  • Code is in compliance with internal logging requirements

@sfc-gh-mkubik sfc-gh-mkubik requested a review from a team as a code owner September 2, 2024 07:49
Copy link
Collaborator

@sfc-gh-astachowski sfc-gh-astachowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding custom to string for vectors, in case vector ever accepts types other than int and float.

@sfc-gh-mkubik
Copy link
Contributor Author

expected:<[FALSE]> but was:<[false]>

Seems I haven't change all ocurrences of upper case booleans in tests, will fix in next commit

Base automatically changed from init-converters-refactor to master September 2, 2024 14:01
sfc-gh-mkubik and others added 9 commits September 4, 2024 13:06
Move prefix and suffix cofiguration to the constructor of base builder, remove unnecessary comments, extract shouldQuote check to a super method, make valueType a constructor parameter for Array toString builder, fix tests failing due to the lowercase booleans
Add helper ArrowStringRepresentationBuilders that take care of converting recursive toString results into a valid json, taking logical type into accunt. Extract fetching logical type from field metadata to a separate static function, change boolean string representations to lowercase, add tests.
Move prefix and suffix cofiguration to the constructor of base builder, remove unnecessary comments, extract shouldQuote check to a super method, make valueType a constructor parameter for Array toString builder, fix tests failing due to the lowercase booleans
@sfc-gh-pbulawa sfc-gh-pbulawa dismissed their stale review September 4, 2024 13:04

Comments to be addressed

@@ -21,6 +24,25 @@ public Object toObject(int index) throws SFException {

@Override
public String toString(int index) throws SFException {
return vector.getObject(index).toString();
FieldVector vectorUnpacked = vector.getChildrenFromFields().get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that there must be at least one child inside? Is get(0) safe? Is it checked somewhere before?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a given for any ListVector

FieldVector vectorUnpacked = vector.getChildrenFromFields().get(0);

FieldVector keys = vectorUnpacked.getChildrenFromFields().get(0);
FieldVector values = vectorUnpacked.getChildrenFromFields().get(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we verify here that the children set contains key-children and value-children?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I belive this should always work for map vector, but I'll verify for empty one

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so there might exist an object of MapVector class that does not have these children, but it seems to be a very weird case. We could either try and verify that it won't happen here (which is probably the case), or simply add a check just to be extra safe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After verification, this shouldn't be empty if used properly, so we are good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to MapVector docs it seems that we're good as we're checking isSet and

The MapVector is nullable, but if a map is set at a given index, there must be an entry.

}

for (int i = vector.getElementStartIndex(index); i < vector.getElementEndIndex(index); i++) {
builder.appendKeyValue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about keyLogicalType? I know that it could only String but it must be changed in future because database could return also Integers keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used only to make the decision on whether the value should be quoted or not. The output string is JSON-like so the key is always quoted even for Integers

}

public ArrowStringRepresentationBuilderBase appendValue(String value) {
addCommaIfNeeded();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you weigh the pros and cons of using StringJoiner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only considered two extreme cases of building the string manually (chosen) and using some abstraction like JSONObject (rejected) but didn't consider StringJointer which is an option in between so I'll also take a look at it

import org.junit.AfterClass;
import org.junit.BeforeClass;

public abstract class BaseWiremockTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is out of scope, true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeees, it appeared here after rebase for some reason but it's a change that should already be merged I believe


@RunWith(Parameterized.class)
@Category(TestCategoryResultSet.class)
public class StructuredTypesGetStringArrowJsonCompatibilityIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add test for AllTypesClass structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants