Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Parquet Byte Stream Split Encoding #5293

Merged
merged 14 commits into from
Jan 12, 2024

Conversation

mwlon
Copy link
Contributor

@mwlon mwlon commented Jan 10, 2024

Which issue does this PR close?

Closes #4102.

This builds on the previous attempt at a PR: #4183

Rationale for this change

This brings us up to speed with the full set of Parquet encodings, I believe. It will also be important for the likely addition of f16 and fixed len byte arrays to the byte stream split encoding.

What changes are included in this PR?

  • implemented byte stream split encoding
  • benchmark suite for encodings
    • Measured the performance as 3x faster than the previous PR's implementation. There are definitely more potential performance wins to be had with SIMD though.
  • tests for the new encoding, including a test file in parquet-testing created via pyarrow.

Are there any user-facing changes?

No API additions

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 10, 2024
@mwlon
Copy link
Contributor Author

mwlon commented Jan 10, 2024

@tustvold

Copy link

@ggreco ggreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not part of the apache foundation but I will try to use your code ASAP in my application. I reviewed it and it seems good.

@@ -0,0 +1,76 @@
use crate::basic::{Encoding, Type};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add the apache license header to get it approved.

@@ -0,0 +1,104 @@
use std::marker::PhantomData;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add the apache license header to get it approved.

@pitrou
Copy link
Member

pitrou commented Jan 10, 2024

Measured the performance as 3x faster than the previous PR's implementation.

FTR, what are the actual numbers?

.downcast_ref::<Float64Array>()
.unwrap();

// This file contains floats from a standard normal distribution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably do an exact comparison on a few values as well, given that the file isn't going to change :-)

],
vec![f32::from_le_bytes([0xA3, 0xB4, 0xC5, 0xD6])],
];
test_byte_stream_split_decode::<FloatType>(data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a test for DoubleType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 👍

@mwlon
Copy link
Contributor Author

mwlon commented Jan 10, 2024

FTR, what are the actual numbers?

On my machine, it was about 17us for floats and 36us for doubles, pretty much the same speed for encoding and decoding. These are all just short of 4GB/s. The implementation I built from was around 1.2GB/s.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code makes sense to me, and looks well tested, thank you

for batch in record_reader {
let batch = batch.unwrap();
row_count += batch.num_rows();
let f32_col = batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@mwlon mwlon Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 👍

@tustvold tustvold changed the title Byte stream split Support Parquet Byte Stream Split Encoding Jan 11, 2024
@tustvold
Copy link
Contributor

Perhaps we could update the README to no longer say that we don't support this encoding

@mwlon
Copy link
Contributor Author

mwlon commented Jan 12, 2024

Updated the readme. Is this ready to merge now?

@tustvold tustvold merged commit 4c3e9be into apache:master Jan 12, 2024
17 checks passed
@tustvold
Copy link
Contributor

Thanks again

@ggreco
Copy link

ggreco commented Jan 16, 2024

@tustvold I see the V50 release is from 4 days ago, and I supposed it should contain this change, but I tried to update to v50 and enable BYTE_STREAM_SPLIT and I get this error:

thread 'upload::tests::test_upload_sensor' panicked at 'called `Result::unwrap()` on an `Err` value: NYI("Encoding BYTE_STREAM_SPLIT is not supported")', /Users/gabry/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-50.0.0/src/column/writer/mod.rs:249:58

@tustvold
Copy link
Contributor

tustvold commented Jan 16, 2024

The release was cut prior to this being merged, as it goes through an ASF mandated RC process that takes at least 3 days. You can view the changelog and/or the git history to see what made a particular release.

@alamb
Copy link
Contributor

alamb commented Mar 2, 2024

Tracking next release in #5453

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet: Support Encoding::BYTE_STREAM_SPLIT
6 participants