Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2241: Update wording of BYTE_STREAM_SPLIT encoding #192

Merged
merged 1 commit into from
Feb 10, 2023

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Feb 10, 2023

Propose to explicitly state that no padding is allowed within a data page. This makes it easier for BYTE_STREAM_SPLIT decoder to decode page with nulls. In this way, it can simply get the number of encoded values by total_length_encoded_stream / K (4 for float and 8 for double). Otherwise, it has to decode def/rep levels to get exact number of non-null values.

@wgtmac
Copy link
Member Author

wgtmac commented Feb 10, 2023

@pitrou
Copy link
Member

pitrou commented Feb 10, 2023

cc @wjones127

@mapleFU
Copy link
Member

mapleFU commented Feb 10, 2023

I think should we check that no more padding is added in all impl? At least, seems C++, Rust, parquet-mr didn't padding at the end of data.

@emkornfield
Copy link
Contributor

Seems OK to me.

@shangxinli shangxinli merged commit 230711f into apache:master Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants