Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Schema::field_names method #5186

Closed
wants to merge 5 commits into from
Closed

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Dec 7, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

Introducing Schema::field_names method similar to Spark https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html#fieldNames()

The method returns list of column names, including nested fields.

Later we can use this code to improve schema output to be more user-friendly
or create toDDL() method, (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructField.html#toDDL--) etc

What changes are included in this PR?

Bunch of extensions to query recursively fields and schema. Also changed an error message to have more details when schema and data mismatches

Are there any user-facing changes?

No

@comphead comphead requested a review from viirya December 7, 2023 16:56
@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Dec 7, 2023
@comphead comphead changed the title Introduce Schema::fieldNames method Introduce Schema::field_names method Dec 7, 2023
@comphead
Copy link
Contributor Author

Hi @tustvold @viirya any plans to review this PR
It would be also very helpful for DF as when DF faces the error like apache/datafusion#8402 (comment)

select rank() over (order by null) from (select 1 a union all select 2 a) q;
Arrow error: Invalid argument error: number of columns(2) must match number of fields(1) in schema

it will give more context without debugging

@tustvold
Copy link
Contributor

Sorry it is on my list, but I'm wondering if we want a more general mechanism to walk these fields as opposed to adding lots of utility methods. I will have a play over the coming days

@comphead
Copy link
Contributor Author

Sorry it is on my list, but I'm wondering if we want a more general mechanism to walk these fields as opposed to adding lots of utility methods. I will have a play over the coming days

Thanks @tustvold I reused 1 inner method instead of 2.
Tried to avoid changing pub methods as it can affect downstream. The generalization approach is great, let me know your thoughts whenever you have time


// Recursively concatenate child names
for child in f.nested_fields() {
nested_field_names_inner(child, format!("{}.", current_name), buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of . as a separator makes me uncomfortable, there have been a lot of bugs in the past resulting from things incorrectly treating . as a nesting delimiter...

This is something I'll try to address when I iterate on the visitor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @tustvold
do you prefer another separator or another approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards a different approach, I don't think there is an unambiguous separator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of approach? perhaps I can help here

Copy link
Contributor

@tustvold tustvold Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vague thought was something similar to how we display parquet schema, where it might be something like

struct batch {
   optional int64 column1
    optional struct nested {
        required float64 f64
    }
}

But implemented as some FieldVisitor abstraction so people can easily do something different should they wish to.

I want to take some time to get my thoughts on this together, and will then get back to you

@comphead comphead marked this pull request as draft January 6, 2024 20:04
@tustvold
Copy link
Contributor

tustvold commented Mar 8, 2024

Apologies for the delay, coming back to this and honestly a little confused as to the use-case. The updated error message uses the new field_names method, but the RecordBatch method itself is not concerned with nested fields, and so printing the nested fields is actually just confusing? It should only print the first level of field names, as that is all RecordBatch is concerned with?

@comphead
Copy link
Contributor Author

Apologies for the delay, coming back to this and honestly a little confused as to the use-case. The updated error message uses the new field_names method, but the RecordBatch method itself is not concerned with nested fields, and so printing the nested fields is actually just confusing? It should only print the first level of field names, as that is all RecordBatch is concerned with?

Thanks @tustvold for having the time for this. The PR is planned to make debugging easier and improve the error text which is currently too unclear.

Arrow error: Invalid argument error: number of columns(2) must match number of fields(1) in schema

The idea is to show which fields exists, and which are missing in the schema. But if the field is nested we should display it in user-friendly way. You are correct about using dots to separate parent field and child fields, this is not very good idea as dots also used in catalog/schema/table qualifiers.

@tustvold tustvold closed this Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants