-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update thrift v0.16 and vendor parquet-format (#2502) #2626
Conversation
f327b3a
to
bebbd30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍
I didn't review the generated thift code.
I also could not verify I could re-create them with the instructions in CONTRIBUTING.md
- I think those instructions should be updated while this is still fresh in our minds
r/src/arrowExports.cpp linguist-generated=true | ||
r/man/*.Rd linguist-generated=true | ||
|
||
parquet/src/format.rs linguist-generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
bytes = { version = "1.1", default-features = false, features = ["std"] } | ||
thrift = { version = "0.13", default-features = false } | ||
thrift = { version = "0.16", default-features = false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -61,6 +61,9 @@ macro_rules! experimental { | |||
pub mod errors; | |||
pub mod basic; | |||
|
|||
#[allow(clippy::derivable_impls, clippy::match_single_binding)] | |||
pub mod format; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub mod format; | |
/// Automatically generated code for reading parquet thrift definition. | |
// see parquet/CONTRIBUTING.md for instructions on regenerating | |
pub mod format; |
parquet/src/basic.rs
Outdated
parquet::Type::DOUBLE => Type::DOUBLE, | ||
parquet::Type::BYTE_ARRAY => Type::BYTE_ARRAY, | ||
parquet::Type::FIXED_LEN_BYTE_ARRAY => Type::FIXED_LEN_BYTE_ARRAY, | ||
_ => return Err(general_err!("unexpected type: {}", value.0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ => return Err(general_err!("unexpected type: {}", value.0)), | |
_ => return Err(general_err!("unexpected parquet type: {}", value.0)), |
parquet/src/basic.rs
Outdated
parquet::FieldRepetitionType::REQUIRED => Repetition::REQUIRED, | ||
parquet::FieldRepetitionType::OPTIONAL => Repetition::OPTIONAL, | ||
parquet::FieldRepetitionType::REPEATED => Repetition::REPEATED, | ||
_ => return Err(general_err!("unexpected repetition type: {}", value.0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ => return Err(general_err!("unexpected repetition type: {}", value.0)), | |
_ => return Err(general_err!("unexpected parquet repetition type: {}", value.0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suggesting adding parquet
to all the errors thinking of the poor end user who will see this without the potential stack to know what the context is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential stack to know what the context is
If only there was a well established solution to this conundrum... grumble 😆
parquet/src/basic.rs
Outdated
parquet::Encoding::DELTA_BYTE_ARRAY => Encoding::DELTA_BYTE_ARRAY, | ||
parquet::Encoding::RLE_DICTIONARY => Encoding::RLE_DICTIONARY, | ||
parquet::Encoding::BYTE_STREAM_SPLIT => Encoding::BYTE_STREAM_SPLIT, | ||
_ => return Err(general_err!("unexpected encoding: {}", value.0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ => return Err(general_err!("unexpected encoding: {}", value.0)), | |
_ => return Err(general_err!("unexpected parquet encoding: {}", value.0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious about the other values (as in what variants are _
matching) so I commented it out:
Compiling parquet v22.0.0 (/Users/alamb/Software/arrow-rs/parquet)
error[E0004]: non-exhaustive patterns: `Encoding(i32::MIN..=-1_i32)`, `Encoding(1_i32)` and `Encoding(10_i32..=i32::MAX)` not covered
--> parquet/src/basic.rs:769:18
|
769 | Ok(match value {
| ^^^^^ patterns `Encoding(i32::MIN..=-1_i32)`, `Encoding(1_i32)` and `Encoding(10_i32..=i32::MAX)` not covered
|
In case any other reviewer was interestd
parquet/src/basic.rs
Outdated
parquet::PageType::INDEX_PAGE => PageType::INDEX_PAGE, | ||
parquet::PageType::DICTIONARY_PAGE => PageType::DICTIONARY_PAGE, | ||
parquet::PageType::DATA_PAGE_V2 => PageType::DATA_PAGE_V2, | ||
_ => return Err(general_err!("unexpected page type: {}", value.0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ => return Err(general_err!("unexpected page type: {}", value.0)), | |
_ => return Err(general_err!("unexpected parquet page type: {}", value.0)), |
parquet/CONTRIBUTING.md
Outdated
# docker build just builds a docker image with thrift dependencies | ||
$ docker build -t thrift build/docker/ubuntu-bionic | ||
# build/docker/scripts/cmake.sh actually compiles thrift | ||
$ docker run -v $(pwd):/thrift/src -it thrift build/docker/scripts/cmake.sh && wget https://github.com/apache/parquet-format/apache-parquet-format-2.9.0/src/main/thrift/parquet.thrift && ./cmake_build/compiler/cpp/bin/thrift --gen rs parquet.thrift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command did not complete successfully for me.
...
441: ----------------------------------------------------------------------
441: Ran 2 tests in 0.012s
441:
441: OK
441: Traceback (most recent call last):
441: File "/thrift/src/test/py/TestServer.py", line 403, in <module>
441: from thrift.TMultiplexedProcessor import TMultiplexedProcessor
441: File "/thrift/src/lib/py/build/lib.linux-x86_64-2.7/thrift/TMultiplexedProcessor.py", line 20, in <module>
441: from thrift.Thrift import TProcessor, TMessageType
441: ImportError: No module named Thrift
441: t.py
441: ----
441: ----------------
441: Executing Client/Server tests with various generated code directories
441: Servers to be tested: TSimpleServer, TThreadedServer, TThreadPoolServer, TNonblockingServer, THttpServer, TProcessPoolServer, TForkingServer
441: Directories to be tested: gen-py-default, gen-py-slots, gen-py-oldstyle, gen-py-no_utf8strings, gen-py-dynamic, gen-py-dynamicslots
441: Protocols to be tested: accel, accelc, binary, compact, json, header
441: Options to be tested: ZLIB(yes/no), SSL(yes/no)
441: ----------------
441:
441: Test run #0: (includes gen-py-default) Server=TSimpleServer, Proto=accel, zlib=False, SSL=False
441: Testing server TSimpleServer: /usr/bin/python /thrift/src/test/py/TestServer.py --protocol=accel --port=9090 TSimpleServer
441: FAIL: Server process (/usr/bin/python /thrift/src/test/py/TestServer.py --protocol=accel --port=9090 TSimpleServer) failed with retcode 1
441: Traceback (most recent call last):
441: File "/thrift/src/test/py/RunClientServer.py", line 323, in <module>
441: sys.exit(main())
441: File "/thrift/src/test/py/RunClientServer.py", line 315, in main
441: tests.test_feature('gendir', generated_dirs)
441: File "/thrift/src/test/py/RunClientServer.py", line 230, in test_feature
441: if self.run(conf, test_count):
441: File "/thrift/src/test/py/RunClientServer.py", line 219, in run
441: runServiceTest(self.libdir, self.genbase, genpydir, try_server, try_proto, self.port, with_zlib, with_ssl, self.verbose)
441: File "/thrift/src/test/py/RunClientServer.py", line 157, in runServiceTest
441: ensureServerAlive()
441: File "/thrift/src/test/py/RunClientServer.py", line 140, in ensureServerAlive
441: % (server_class, ' '.join(server_args)))
441: Exception: Server subprocess TSimpleServer died, args: /usr/bin/python /thrift/src/test/py/TestServer.py --protocol=accel --port=9090 TSimpleServer
441/441 Test #441: python_test .......................***Failed 28.49 sec
99% tests passed, 6 tests failed out of 441
Total Test time (real) = 250.09 sec
The following tests FAILED:
382 - TInterruptTest (Failed)
401 - TNonblockingSSLServerTest (Failed)
403 - SecurityTest (Failed)
404 - SecurityFromBufferTest (Failed)
433 - PythonTestSSLSocket (Failed)
441 - python_test (Failed)
Errors while running CTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I ended up just using the version vended by Arch linux... Perhaps I will just update the instructions to do that 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- maybe docker run a arch linux
image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, although of course the formatting is now different... But at least we now have a one-liner
To generate the parquet format (thrift definitions) code run from the repository root run | ||
|
||
``` | ||
$ docker run -v $(pwd):/thrift/src -it archlinux pacman -Sy --noconfirm thrift && wget https://github.com/apache/parquet-format/apache-parquet-format-2.9.0/src/main/thrift/parquet.thrift -O /tmp/parquet.thrift && thrift --gen rs /tmp/parquet.thrift && sed -i '/use thrift::server::TProcessor;/d' parquet.rs && mv parquet.rs parquet/src/format.rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this and it works great:
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-rs$ diff parquet.rs parquet/src/format.rs
26d25
< use thrift::server::TProcessor;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I only managed to skim through it but overall looks great. Thanks @tustvold !
Benchmark runs are scheduled for baseline = 2f360e1 and contender = d88ed6a. d88ed6a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
🎉 |
Which issue does this PR close?
Closes #2502
Rationale for this change
See ticket
What changes are included in this PR?
Vendors the generated thrift code, and updates to thrift v0.16. Unfortunately thrift no longer generates enumerations, instead generating constants. This requires some churn in the conversion logic, and making some things fallible that previously weren't.
Are there any user-facing changes?
Yes, although only to the low-level interfaces.