Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to read parquet binary column as UTF8 type #6539

Merged
merged 2 commits into from
Oct 10, 2024

Conversation

goldmedal
Copy link
Contributor

Which issue does this PR close?

No related issue.

Rationale for this change

While working on apache/datafusion#12788 (comment) in DataFusion, I found we can't read the parquet binary column as string types (Utf8, LargeUtf8, or Utf8View) through ArrowReaderOptions::with_schema. I think it makes sense to read them as strings if the user ensures it's a string binary value.

What changes are included in this PR?

I added some matching rules in apply_hint in parquet/src/arrow/schema/primitive.rs to handle the binary-to-string cases.

Are there any user-facing changes?

no

cc @alamb

.build()
.expect("reader with schema");

arrow_reader.next();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arrow_reader.next();
arrow_reader.next().unwrap_err();

As this should error, given the data isn't actually UTF-8

Comment on lines 3132 to 3136
.column(0)
.as_any()
.downcast_ref::<StringArray>()
.expect("downcast to string")
.iter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.column(0)
.as_any()
.downcast_ref::<StringArray>()
.expect("downcast to string")
.iter()
.column(0)
.as_string::<i32>()
.iter()

And the same below

@@ -57,6 +57,11 @@ fn apply_hint(parquet: DataType, hint: DataType) -> DataType {
(DataType::Utf8, DataType::LargeUtf8) => hint,
(DataType::Binary, DataType::LargeBinary) => hint,

// Read as Utf8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

}

#[test]
#[should_panic(expected = "Invalid UTF8 sequence at")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @goldmedal and @tustvold -- this looks great to me

@tustvold tustvold merged commit 89075a7 into apache:master Oct 10, 2024
16 checks passed
@goldmedal goldmedal deleted the feature/read-as-string branch October 10, 2024 17:27
@goldmedal
Copy link
Contributor Author

Thanks @alamb @tustvold !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants