-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set io implementation as provided for smb and parquet #4857
Conversation
// parquet-avro depends on avro 1.10.x | ||
"org.apache.avro" % "avro", | ||
"org.apache.avro" % "avro-compiler" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Libs are marked as provided
in me.lyh:parquet-avro
"org.apache.avro" % "avro" % avroVersion % Provided, | ||
"org.apache.hadoop" % "hadoop-common" % hadoopVersion % Provided, | ||
"org.apache.parquet" % "parquet-avro" % parquetVersion % Provided excludeAll ("org.apache.avro" % "avro"), | ||
"org.apache.parquet" % "parquet-column" % parquetVersion % Provided, | ||
"org.apache.parquet" % "parquet-common" % parquetVersion % Provided, | ||
"org.apache.parquet" % "parquet-hadoop" % parquetVersion % Provided, | ||
"org.tensorflow" % "tensorflow-core-api" % tensorFlowVersion % Provided, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For smb, I think it is safe to assume pple will also have scio-avro
scio-parquet
or scio-tensorflow
as depenedecy. Do not pull everything here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm I do think it's fairly common to have scio-smb but not scio-parquet in your dependencies; I think this change makes sense, but let's definitely call it out in the migration guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we detect at runtime & issue a reasonable warning, e.g. if a user uses an instance of ParquetAvroSortedBucketIO
but scio-parquet
is not available we fail? Maybe try some loading some standard or sentinel class/value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ClassNotFoundException
should be a big enough clue.
I'll update the SMB part of the doc to make sure we mention that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we can wrap ClassNotFoundException
with more meaningful instructions. Even with good docs, there will be confused users writing to support channels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some checks in the static class/object initialization of the various SMB impl
@@ -1068,7 +1064,8 @@ lazy val `scio-parquet`: Project = project | |||
"org.apache.parquet" % "parquet-hadoop" % parquetVersion, | |||
"org.scala-lang.modules" %% "scala-collection-compat" % scalaCollectionCompatVersion, | |||
"org.slf4j" % "slf4j-api" % slf4jVersion, | |||
"org.tensorflow" % "tensorflow-core-api" % tensorFlowVersion, | |||
// provided | |||
"org.tensorflow" % "tensorflow-core-api" % tensorFlowVersion % Provided, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
"org.apache.avro" % "avro" % avroVersion % Provided, | ||
"org.apache.hadoop" % "hadoop-common" % hadoopVersion % Provided, | ||
"org.apache.parquet" % "parquet-avro" % parquetVersion % Provided excludeAll ("org.apache.avro" % "avro"), | ||
"org.apache.parquet" % "parquet-column" % parquetVersion % Provided, | ||
"org.apache.parquet" % "parquet-common" % parquetVersion % Provided, | ||
"org.apache.parquet" % "parquet-hadoop" % parquetVersion % Provided, | ||
"org.tensorflow" % "tensorflow-core-api" % tensorFlowVersion % Provided, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm I do think it's fairly common to have scio-smb but not scio-parquet in your dependencies; I think this change makes sense, but let's definitely call it out in the migration guide.
Codecov Report
@@ Coverage Diff @@
## main #4857 +/- ##
==========================================
+ Coverage 62.46% 62.53% +0.06%
==========================================
Files 282 281 -1
Lines 10412 10433 +21
Branches 776 781 +5
==========================================
+ Hits 6504 6524 +20
- Misses 3908 3909 +1
|
9bcb25b
to
1b4b791
Compare
d7620d2
to
f7ec1a1
Compare
Fix #4788
Fix #4327