Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix AvroReader: Add union resolving for nested struct arrays #12686

Merged
merged 6 commits into from
Oct 2, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -573,7 +573,7 @@ impl<'a, R: Read> AvroArrowArrayReader<'a, R> {
// extract list values, with non-lists converted to Value::Null
let array_item_count = rows
.iter()
.map(|row| match row {
.map(|row| match maybe_resolve_union(row) {
Value::Array(values) => values.len(),
_ => 1,
})
Expand Down Expand Up @@ -1643,6 +1643,93 @@ mod test {
assert_batches_eq!(expected, &[batch]);
}

#[test]
fn test_avro_nullable_struct_array() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this test without the changes in this PR and it still passes 🤔 Thus I don't think it covers the code change

andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ git diff
diff --git a/datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs b/datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs
index 24bc4a61c..8735eac5f 100644
--- a/datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs
+++ b/datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs
@@ -573,7 +573,7 @@ impl<'a, R: Read> AvroArrowArrayReader<'a, R> {
                 // extract list values, with non-lists converted to Value::Null
                 let array_item_count = rows
                     .iter()
-                    .map(|row| match maybe_resolve_union(row) {
+                    .map(|row| match row {
                         Value::Array(values) => values.len(),
                         _ => 1,
                     })

And then I ran

andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ cargo test --lib  -p datafusion --all-features -- avro_to_arrow
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.12s
     Running unittests src/lib.rs (target/debug/deps/datafusion-3bc79b97e7a0ad1b)

running 16 tests
test datasource::avro_to_arrow::schema::test::test_invalid_avro_schema ... ok
test datasource::avro_to_arrow::schema::test::test_non_record_schema ... ok
test datasource::avro_to_arrow::schema::test::test_alias ... ok
test datasource::avro_to_arrow::schema::test::test_plain_types_schema ... ok
test datasource::avro_to_arrow::schema::test::test_external_props ... ok
test datasource::avro_to_arrow::schema::test::test_nested_schema ... ok
test datasource::avro_to_arrow::arrow_array_reader::test::test_avro_read_nested_list ... ok
test datasource::avro_to_arrow::arrow_array_reader::test::test_time_avro_milliseconds ... ok
test datasource::avro_to_arrow::reader::tests::test_avro_basic ... ok
test datasource::avro_to_arrow::arrow_array_reader::test::test_avro_read_list ... ok
test datasource::avro_to_arrow::arrow_array_reader::test::test_avro_iterator ... ok
test datasource::avro_to_arrow::arrow_array_reader::test::test_avro_nullable_struct ... ok
test datasource::avro_to_arrow::arrow_array_reader::test::test_avro_nullable_struct_array ... ok
test datasource::avro_to_arrow::arrow_array_reader::test::test_complex_list ... ok
test datasource::avro_to_arrow::arrow_array_reader::test::test_deep_nullable_struct ... ok
test datasource::avro_to_arrow::arrow_array_reader::test::test_complex_struct ... ok

test result: ok. 16 passed; 0 failed; 0 ignored; 0 measured; 729 filtered out; finished in 0.00s

andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. Yes the test rather checks the happy path. I will add a test for the index error tomorrow and update the formating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the test. It now fails without the change:

thread 'datasource::avro_to_arrow::arrow_array_reader::test::test_avro_nullable_struct_array' panicked at /Users/JONSchmi/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-buffer-52.2.0/src/util/bit_util.rs:55:5:
index out of bounds: the len is 1 but the index is 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed -- I double checked and it does fail!

let schema = apache_avro::Schema::parse_str(
r#"
{
"type": "record",
"name": "r1",
"fields": [
{
"name": "col1",
"type": [
"null",
{
"type": "array",
"items": {
"type": [
"null",
{
"type": "record",
"name": "Item",
"fields": [
{
"name": "id",
"type": "long"
}
]
}
]
}
}
],
"default": null
}
]
}"#,
)
.unwrap();
let jv1 = serde_json::json!({
"col1": [
{
"id": 234
},
{
"id": 345
}
]
});
let r1 = apache_avro::to_value(jv1)
.unwrap()
.resolve(&schema)
.unwrap();
let r2 = apache_avro::to_value(serde_json::json!({ "col1": null }))
.unwrap()
.resolve(&schema)
.unwrap();

let mut w = apache_avro::Writer::new(&schema, vec![]);
for _i in 0..5 {
w.append(r1.clone()).unwrap();
}
w.append(r2).unwrap();
let bytes = w.into_inner().unwrap();

let mut reader = ReaderBuilder::new()
.read_schema()
.with_batch_size(20)
.build(std::io::Cursor::new(bytes))
.unwrap();
let batch = reader.next().unwrap().unwrap();
assert_eq!(batch.num_rows(), 6);
assert_eq!(batch.num_columns(), 1);

let expected = [
"+------------------------+",
"| col1 |",
"+------------------------+",
"| [{id: 234}, {id: 345}] |",
"| [{id: 234}, {id: 345}] |",
"| [{id: 234}, {id: 345}] |",
"| [{id: 234}, {id: 345}] |",
"| [{id: 234}, {id: 345}] |",
"| |",
"+------------------------+",
];
assert_batches_eq!(expected, &[batch]);
}

#[test]
fn test_avro_iterator() {
let reader = build_reader("alltypes_plain.avro", 5);
Expand Down
Loading