Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Schema::field_names method #5186

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion arrow-array/src/record_batch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -149,9 +149,10 @@ impl RecordBatch {
// check that number of fields in schema match column length
if schema.fields().len() != columns.len() {
return Err(ArrowError::InvalidArgumentError(format!(
"number of columns({}) must match number of fields({}) in schema",
"Mismatch between columns [{}] and schema fields [{}].\nKnown schema fields[{}]",
columns.len(),
schema.fields().len(),
schema.field_names(false).join(","),
)));
}

Expand Down
2 changes: 1 addition & 1 deletion arrow-flight/tests/encode_decode.rs
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ async fn test_mismatched_record_batch_schema() {
let err = result.unwrap_err();
assert_eq!(
err.to_string(),
"Arrow(InvalidArgumentError(\"number of columns(1) must match number of fields(2) in schema\"))"
"Arrow(InvalidArgumentError(\"Mismatch between columns [1] and schema fields [2].\\nKnown schema fields[i,f]\"))"
);
}

Expand Down
70 changes: 65 additions & 5 deletions arrow-schema/src/field.rs
Original file line number Diff line number Diff line change
Expand Up @@ -328,20 +328,80 @@ impl Field {
/// within `self` contained within this field (including `self`)
pub(crate) fn fields(&self) -> Vec<&Field> {
let mut collected_fields = vec![self];
collected_fields.append(&mut Field::_fields(&self.data_type));
collected_fields.append(&mut Field::_fields(&self.data_type, true));

collected_fields
}

fn _fields(dt: &DataType) -> Vec<&Field> {
/// Returns a [`Vec`] direct children [`Field`]s
/// within `self`
pub(crate) fn nested_fields(&self) -> Vec<&Field> {
Field::_fields(&self.data_type, false)
}

/// Return self and direct children field names of the [`Field`]
///
/// ```
/// # use arrow_schema::*;
/// let field = Field::new("nested",
/// DataType::Struct(
/// Fields::from(
/// vec![
/// Field::new("inner",
/// DataType::Struct(
/// Fields::from(
/// vec![
/// Field::new("a", DataType::Int32, true)
/// ])), true)])), true
/// );
///
/// assert_eq!(field.children_names(), vec!["nested", "nested.inner", "nested.inner.a"]);
/// ```
pub fn children_names(&self) -> Vec<String> {
fn nested_field_names_inner(f: &Field, parent_name: String, buffer: &mut Vec<String>) {
let current_name = format!("{}{}", parent_name, f.name());

// Push the concatenated name to the result vector
buffer.push(current_name.clone());

// Recursively concatenate child names
for child in f.nested_fields() {
nested_field_names_inner(child, format!("{}.", current_name), buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of . as a separator makes me uncomfortable, there have been a lot of bugs in the past resulting from things incorrectly treating . as a nesting delimiter...

This is something I'll try to address when I iterate on the visitor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @tustvold
do you prefer another separator or another approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards a different approach, I don't think there is an unambiguous separator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of approach? perhaps I can help here

Copy link
Contributor

@tustvold tustvold Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vague thought was something similar to how we display parquet schema, where it might be something like

struct batch {
   optional int64 column1
    optional struct nested {
        required float64 f64
    }
}

But implemented as some FieldVisitor abstraction so people can easily do something different should they wish to.

I want to take some time to get my thoughts on this together, and will then get back to you

}
}

if !&self.data_type().is_nested() {
vec![]
} else {
let mut result: Vec<String> = Vec::new();
nested_field_names_inner(self, "".to_string(), &mut result);
result
}
}

// Return inner fields
// flatten - if inner fields needs to be flattened
fn _fields(dt: &DataType, flatten: bool) -> Vec<&Field> {
match dt {
DataType::Struct(fields) => fields.iter().flat_map(|f| f.fields()).collect(),
DataType::Union(fields, _) => fields.iter().flat_map(|(_, f)| f.fields()).collect(),
DataType::Struct(fields) => {
if flatten {
fields.iter().flat_map(|f| f.fields()).collect()
} else {
fields.iter().map(|f| f.as_ref()).collect()
}
}
DataType::Union(fields, _) => {
if flatten {
fields.iter().flat_map(|(_, f)| f.fields()).collect()
} else {
fields.iter().map(|f| f.1.as_ref()).collect()
}
}
DataType::List(field)
| DataType::LargeList(field)
| DataType::FixedSizeList(field, _)
| DataType::Map(field, _) => field.fields(),
DataType::Dictionary(_, value_field) => Field::_fields(value_field.as_ref()),
DataType::Dictionary(_, value_field) => Field::_fields(value_field.as_ref(), flatten),
_ => vec![],
}
}
Expand Down
48 changes: 48 additions & 0 deletions arrow-schema/src/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -408,6 +408,29 @@ impl Schema {
pub fn remove(&mut self, index: usize) -> FieldRef {
self.fields.remove(index)
}

/// Returns a field names from schema with references to fields
/// * `nested` - include nested fields.
#[inline]
pub fn field_names(&self, nested: bool) -> Vec<String> {
if nested {
self.fields
.iter()
.flat_map(|f| {
if f.data_type().is_nested() {
f.children_names()
} else {
vec![f.name().to_string()]
}
})
.collect::<Vec<_>>()
} else {
self.fields
.iter()
.map(|f| f.name().clone())
.collect::<Vec<_>>()
}
}
}

impl fmt::Display for Schema {
Expand Down Expand Up @@ -957,4 +980,29 @@ mod tests {
assert_eq!(out.metadata["k"], "v");
assert_eq!(out.metadata["key"], "value");
}

#[test]
fn test_schema_field_names() {
use crate::Field;
let mut builder = SchemaBuilder::new();
builder.push(Field::new("a", DataType::Int32, false));
builder.push(Field::new("b", DataType::Utf8, false));
builder.push(Field::new(
"nested",
DataType::Struct(Fields::from(vec![Field::new(
"inner",
DataType::Struct(Fields::from(vec![Field::new("a", DataType::Int32, true)])),
true,
)])),
true,
));

let schema = builder.finish();
assert_eq!(schema.field_names(false), vec!["a", "b", "nested"]);

assert_eq!(
schema.field_names(true),
vec!["a", "b", "nested", "nested.inner", "nested.inner.a"]
);
}
}
Loading