Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested list levels calculated incorrectly if list has 0 length element #282

Closed
nevi-me opened this issue May 11, 2021 · 2 comments
Closed
Labels
bug development-process Related to development process of arrow-rs parquet Changes to the parquet crate

Comments

@nevi-me
Copy link
Contributor

nevi-me commented May 11, 2021

Describe the bug

First documented in #270 (comment).

When trying to write some combinations of nested Arrow data to Parquet, we trigger a bounds error on the level calculations.
The most obvious thing that could be going wrong is that we're not correctly accounting for empty list slot vs null list slot.

This is because the error gets triggered around the logic that does this.

To Reproduce

Try the below test:

#[test]
fn test_write_ipc_nested_lists() {
    let fields = vec![Field::new(
        "list_a",
        DataType::List(Box::new(Field::new(
            "list_b",
            DataType::List(Box::new(Field::new(
                "struct_c",
                DataType::Struct(vec![
                    Field::new("prim_d", DataType::Boolean, true),
                    Field::new(
                        "list_e",
                        DataType::LargeList(Box::new(Field::new(
                            "string_f",
                            DataType::LargeUtf8,
                            true,
                        ))),
                        false,
                    ),
                ]),
                true,
            ))),
            false,
        ))),
        true,
    )];
    let schema = Arc::new(Schema::new(fields));
    // making this nullable guarantees that one of the list items will be empty, triggering the error
    let batch = arrow::util::data_gen::create_random_batch(schema, 3, 0.35, 0.6).unwrap();

    // write ipc (to read in pyarrow, and write parquet from pyarrow)
    let file = File::create("arrow_nested_random.arrow").unwrap();
    let mut writer =
        arrow::ipc::writer::FileWriter::try_new(file, batch.schema().as_ref()).unwrap();
    writer.write(&batch).unwrap();
    writer.finish().unwrap();

    let file = File::create("arrow_nested_random_rust.parquet").unwrap();
    let mut writer =
        ArrowWriter::try_new(file.try_clone().unwrap(), batch.schema(), None)
            .expect("Unable to write file");

    // this will trigger the error in question
    writer.write(&batch).unwrap();
    writer.close().unwrap();
}

Expected behavior

The parquet file should be written correctly, and pyarrow or Spark should be able to read the data correctly.

Additional context

Not sure

@nevi-me nevi-me added parquet Changes to the parquet crate bug labels May 11, 2021
@nevi-me nevi-me changed the title Nested list levels not calculated correctly if list has 0 length element Nested list levels calculated incorrectly if list has 0 length element May 11, 2021
@alamb
Copy link
Contributor

alamb commented Jul 29, 2022

I wonder if this is still an issue after the recent work from @tustvold and others to clean up nested struct / null handling?

@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 29, 2022

Reminder to self to test and close this if it's no longer an issue (very likely)

@tustvold tustvold added the development-process Related to development process of arrow-rs label Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug development-process Related to development process of arrow-rs parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants