Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Reading and Writing Extension FileTypes #8667

Closed
wants to merge 5 commits into from

Conversation

devinjdangelo
Copy link
Contributor

@devinjdangelo devinjdangelo commented Dec 28, 2023

Which issue does this PR close?

Closes #8657
Closes #8637
Closes #8345

Rationale for this change

DataFusion is built to be as extensible as possible. We currently have a FileType enum which prevents external crates adding support for reading/writing new FileFormats in a ListingTable.

What changes are included in this PR?

Creates a new FileType::Exension(Box<dyn ExtensionFileType>) variant.

Are these changes tested?

Partially by existing tests, additional likely needed

Are there any user-facing changes?

Ability to extend DataFusion to support reading/writing additional file formats more easily

@github-actions github-actions bot added the core Core DataFusion crate label Dec 28, 2023
Copy link
Contributor Author

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left built in FileTypes as concrete enum variants, while supporting externally defined FileTypes via a boxed trait object contained in a FileType variant. I think this is the path of least resistance towards supporting externally defined FileFormats along the read and write path in a ListingTable.

We could potentially also refactor the built in types (parquet, csv, ect) to be externally defined structs which implement ExtensionFileType and remove the FileType enum entirely. This would entail a more significant refactor of FileTypeWriterOptions as well.

@@ -41,7 +43,8 @@ pub trait GetExt {
}

/// Readable file type
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
#[allow(clippy::derived_hash_with_manual_eq)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the manual and auto traits should be compatible, so silencing this is OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably also manually implement Hash pretty easily as well, but I agree it isn't a bad thing

/// For extension [FileType]s, FileTypeWriterOptions simply stores
/// any passed StatementOptions to be handled later by any custom
/// physical plans (Such as a FileFormat::create_writer_physical_plan)
Extension(Option<StatementOptions>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileTypeWriterOptions does not handle any parsing or validation for externally defined FileTypes. It simply acts as a container for the options passed in.

@@ -891,6 +891,9 @@ impl TryFrom<&FileTypeWriterOptions> for protobuf::FileTypeWriterOptions {
FileTypeWriterOptions::Arrow(ArrowWriterOptions {}) => {
return not_impl_err!("Arrow file sink protobuf serialization")
}
FileTypeWriterOptions::Extension(_) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to support serializing this since it is ultimately just a Vec<(String, String)>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also punt it into the FileType::Extension (e.g. add a serialize(&self) -> Result<Vec<u8>> type method)

@@ -605,6 +605,7 @@ impl DefaultPhysicalPlanner {
FileType::JSON => Arc::new(JsonFormat::default()),
FileType::AVRO => Arc::new(AvroFormat {} ),
FileType::ARROW => Arc::new(ArrowFormat {}),
FileType::Extension(_ext) => return not_impl_err!("Extension FileTypes not supported in Copy To statements.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The built in Copy plans can't support external types (since we can't initialize them in DataFusion), but it should be not too difficult to build a custom COPY PhysicalPlan that uses a custom FileFormat, but otherwise is the same and relies on the build in Copy logical plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have the FileTypeExtension have a method that returned Arc<dyn FileFormat> ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this (as well as combining ExtensionFileType trait into FileFormat), but that would require adding lots of dependencies between the DataFusion crates. FileType is in common and used in most of DataFusion while FileFormat is only in core.

This could work but I think the consequence would be DataFusion becomes one big crate. I think the only way to keep the independent crates is to have at least two File related abstractions. One simple one for logical planning in Common and one heavy duty one for Execution plans in core.

Even linking them is a challenge because then common depends on core 🤔

@devinjdangelo devinjdangelo marked this pull request as draft December 28, 2023 16:10
@devinjdangelo devinjdangelo changed the title WIP: Support Reading and Writing Extension FileTypes Support Reading and Writing Extension FileTypes Dec 28, 2023
@alamb
Copy link
Contributor

alamb commented Dec 29, 2023

We could potentially also refactor the built in types (parquet, csv, ect) to be externally defined structs which implement ExtensionFileType and remove the FileType enum entirely. This would entail a more significant refactor of FileTypeWriterOptions as well.

I think this would make an excellent follow on PR.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @devinjdangelo -- this is a very neat PR.

As you say, I think making FileType itself a trait would be the ideal solution, but doing so would require refactoring a bunch more refactoring (likely FileTypeWriterOptions a trait for example)

I wonder if you have time to try that alternate approach and see how invasive it turns out to be 🤔

@@ -98,7 +98,7 @@ jobs:
with:
rust-version: stable
- name: Run tests (excluding doctests)
run: cargo test --lib --tests --bins --features avro,json,backtrace
run: RUST_MIN_STACK=504857600 cargo test --lib --tests --bins --features avro,json,backtrace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the new stack size needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Locally I did not need this. To get CI tests to pass I needed to boost stack size.

@@ -41,7 +43,8 @@ pub trait GetExt {
}

/// Readable file type
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
#[allow(clippy::derived_hash_with_manual_eq)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably also manually implement Hash pretty easily as well, but I agree it isn't a bad thing

@@ -1716,6 +1719,9 @@ mod tests {
)
.await?;
}
FileType::Extension(_) => {
panic!("Extension file type not implemented in write path.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead return an NotImplement exception?

@@ -891,6 +891,9 @@ impl TryFrom<&FileTypeWriterOptions> for protobuf::FileTypeWriterOptions {
FileTypeWriterOptions::Arrow(ArrowWriterOptions {}) => {
return not_impl_err!("Arrow file sink protobuf serialization")
}
FileTypeWriterOptions::Extension(_) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also punt it into the FileType::Extension (e.g. add a serialize(&self) -> Result<Vec<u8>> type method)

@@ -605,6 +605,7 @@ impl DefaultPhysicalPlanner {
FileType::JSON => Arc::new(JsonFormat::default()),
FileType::AVRO => Arc::new(AvroFormat {} ),
FileType::ARROW => Arc::new(ArrowFormat {}),
FileType::Extension(_ext) => return not_impl_err!("Extension FileTypes not supported in Copy To statements.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have the FileTypeExtension have a method that returned Arc<dyn FileFormat> ?

@devinjdangelo
Copy link
Contributor Author

Thank you for the review @alamb. It should be achievable to move the built in FileTypes to use the same trait interface and retire the concrete enum types. This would serve as a built in example of how to implement the trait for extension types as well and would even allow pulling out the "built in" files supported to their own crates.

I'll try to find some time within the next week to try this.

@alamb
Copy link
Contributor

alamb commented Dec 31, 2023

well and would even allow pulling out the "built in" files supported to their own crates.

I think this would be an ideal long term outcome (e.g. have a datafusion-parquet and datafusion-json, datafusion-arrow crates that came with the specific format support

@devinjdangelo
Copy link
Contributor Author

I did not get back to this when Id hoped. It is still on my list, but I probably will not get to it until this weekend.

@devinjdangelo
Copy link
Contributor Author

I've been thinking about this more, and I think we need a similar interface to how we register a custom TableProvider at the SessionContext level. The idea being we'd have a SessionContext::register_file_type which accepts an Arc<dyn FileType> and stores it in a DashMap<String, Arc<dyn FileType>>. At planning and execution time, all the places we are doing a match FileType currently, would be replaced with a lookup on the map of registered FileTypes + a method call on the trait object.

The above is based on my understanding having studied a bit how we handle external TableProviders. @alamb, does the above sound reasonable to you?

My main uncertainty to the above is how we would handle serialization. Perhaps the FileType trait could have a method which returns a Vec<u8> based serialization of itself. I'm not sure how that would work on the receiving end exactly...

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions labels Jan 6, 2024
@alamb
Copy link
Contributor

alamb commented Jan 8, 2024

The above is based on my understanding having studied a bit how we handle external TableProviders. @alamb, does the above sound reasonable to you?

Yes.

My main uncertainty to the above is how we would handle serialization. Perhaps the FileType trait could have a method which returns a Vec<u8> based serialization of itself. I'm not sure how that would work on the receiving end exactly...

I think the way serialization is handled today is via "ExtensionCodec"

Which do exactly as you suggest and serialize table providers to/from bytes

maybe @thinkharderdev has some more thoughts to share

@thinkharderdev
Copy link
Contributor

Yeah, the extension codecs seem like a logical place to handle this.

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 14, 2024
@github-actions github-actions bot closed this Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner Stale PR has not had any activity for some time
Projects
None yet
3 participants