Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change FileScanConfig.table_partition_cols from (String, DataType) to Fields #7890

Merged
merged 5 commits into from
Oct 22, 2023

Conversation

NGA-TRAN
Copy link
Contributor

@NGA-TRAN NGA-TRAN commented Oct 20, 2023

Which issue does this PR close?

Closes #7875

Rationale for this change

Currently, FileScanConfig.table_partition_cols has data type Vec<(String, DataType)> to store only columns name and its data type. A column can include many more information such as nullable and extra meta data. Thus, when we convert table_partition_cols to Fields here, all other information of a field will either empty or default.

We want the data type of table_partition_cols a vector of Fields in the first place so when we need to store a Field, we won't lose any information.

FYI: IOx needs this requirement.

What changes are included in this PR?

Replace data type of FileScanConfig.table_partition_cols from Vec<(String, DataType)> to Vec`

Are these changes tested?

Yes

Are there any user-facing changes?

The API to create FileScanConfig needs a vector of Fields for table_partition_cols. Most of the places it is an empty vector means it is not used.

@github-actions github-actions bot added the core Core DataFusion crate label Oct 20, 2023
@NGA-TRAN
Copy link
Contributor Author

@alamb and @crepererum
Can you help review this? I do not have permission to set the label to api-change

@@ -101,7 +101,7 @@ pub struct FileScanConfig {
/// all records after filtering are returned.
pub limit: Option<usize>,
/// The partitioning columns
pub table_partition_cols: Vec<(String, DataType)>,
pub table_partition_cols: Vec<Field>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the key change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to use FieldRef not Field

@@ -135,8 +135,7 @@ impl FileScanConfig {
table_cols_stats.push(self.statistics.column_statistics[idx].clone())
} else {
let partition_idx = idx - self.file_schema.fields().len();
let (name, dtype) = &self.table_partition_cols[partition_idx];
table_fields.push(Field::new(name, dtype.to_owned(), false));
table_fields.push(self.table_partition_cols[partition_idx].to_owned());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this where we convert table_partition_cols to Field

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @NGA-TRAN -- I think the idea of passing a real Field as the partition column makes a lot of sense and that this PR does it very nicely 👍

I had a few code improvement suggestions, but nothing I think is required to merge this.

Thanks again

datafusion/core/src/datasource/physical_plan/avro.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/physical_plan/csv.rs Outdated Show resolved Hide resolved
)
}

fn config_for_proj_with_field_tab_part(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this name confusing given the three letter abbreviations and I don't think this is common elsewhere in the DataFusion codebase.

How about something like

Suggested change
fn config_for_proj_with_field_tab_part(
fn config_for_projection_with_partition_fields(

Or maybe instead you could change config_for_projection to take table_partition_cols: Vec<Field>, and make a function like

/// Convert all 
fn partition_cols(  table_partition_cols: Vec<(&str, DataType)>) -> Vec<Field> {
         table_partition_cols
            .iter()
            .map(|(name, dtype)| Field::new(name, dtype.clone(), false))
            .collect::<Vec<_>>()
}

And then convert the call sites of config_for_projection to be config_for_projection(.., partition_cols(..))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented your second suggestion @alamb . Thanks

@alamb alamb changed the title feat: make data type of FileScanConfig.table_partition_cols a vector of Fields Change FileScanConfig.table_partition_cols from (String, DataType) to Fields Oct 20, 2023
@NGA-TRAN
Copy link
Contributor Author

I have addressed all the comments. Thanks @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NGA-TRAN

@Dandandan Dandandan merged commit ca5dc8c into apache:main Oct 22, 2023
22 checks passed
@Dandandan
Copy link
Contributor

Thanks @NGA-TRAN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make data type of FileScanConfig.table_partition_cols a vector of Field
4 participants