Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify parquet statistics generation #5183

Merged
merged 1 commit into from
Dec 8, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 7, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

Follow up to #5181

What changes are included in this PR?

Are there any user-facing changes?

// If only computing chunk-level statistics compute them here, page-level statistics
// are computed in [`Self::write_mini_batch`] and used to update chunk statistics in
// [`Self::add_data_page`]
if self.statistics_enabled == EnabledStatistics::Chunk
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the major change, rather than computing the chunk statistics here, we always compute statistics per-page, unless completely disabled, and just skip writing them to the page as per the change in #5181.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defiantly simpler than #5181 IMO. Having less places that do update_min/update_max(...) on the column's stats makes it much easier to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any performance implications of always computing the statistics even when they aren't written to the page?

Though it seems like if we are writing column chunk level statistics anyways we would have to calculate the min/max across all values anyways, so tracking per page shouldn't be any more expensive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially for byte arrays you will have additional allocations per page, where previously you would have them per chunk. In practice this is extremely unlikely to matter.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this and it makes sense to me, though I am not a super expert in this area of code

// If only computing chunk-level statistics compute them here, page-level statistics
// are computed in [`Self::write_mini_batch`] and used to update chunk statistics in
// [`Self::add_data_page`]
if self.statistics_enabled == EnabledStatistics::Chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any performance implications of always computing the statistics even when they aren't written to the page?

Though it seems like if we are writing column chunk level statistics anyways we would have to calculate the min/max across all values anyways, so tracking per page shouldn't be any more expensive

}
}
};
if let Some(min) = min {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This min is only passed in when precomputed statistics are used (write_batch_with_statistics). Is that correct? I couldn't find another callsite where min and max were non zero

@alamb
Copy link
Contributor

alamb commented Dec 8, 2023

Thank you @tustvold

@tustvold tustvold merged commit 7e28913 into apache:master Dec 8, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants