Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set io implementation as provided for smb and parquet #4857

Merged
merged 4 commits into from
Jun 8, 2023

Conversation

RustedBones
Copy link
Contributor

@RustedBones RustedBones commented Jun 5, 2023

Fix #4788
Fix #4327

Comment on lines -1051 to -1053
// parquet-avro depends on avro 1.10.x
"org.apache.avro" % "avro",
"org.apache.avro" % "avro-compiler"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Libs are marked as provided in me.lyh:parquet-avro

Comment on lines +1356 to +1362
"org.apache.avro" % "avro" % avroVersion % Provided,
"org.apache.hadoop" % "hadoop-common" % hadoopVersion % Provided,
"org.apache.parquet" % "parquet-avro" % parquetVersion % Provided excludeAll ("org.apache.avro" % "avro"),
"org.apache.parquet" % "parquet-column" % parquetVersion % Provided,
"org.apache.parquet" % "parquet-common" % parquetVersion % Provided,
"org.apache.parquet" % "parquet-hadoop" % parquetVersion % Provided,
"org.tensorflow" % "tensorflow-core-api" % tensorFlowVersion % Provided,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For smb, I think it is safe to assume pple will also have scio-avro scio-parquet or scio-tensorflow as depenedecy. Do not pull everything here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm I do think it's fairly common to have scio-smb but not scio-parquet in your dependencies; I think this change makes sense, but let's definitely call it out in the migration guide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we detect at runtime & issue a reasonable warning, e.g. if a user uses an instance of ParquetAvroSortedBucketIO but scio-parquet is not available we fail? Maybe try some loading some standard or sentinel class/value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClassNotFoundException should be a big enough clue.
I'll update the SMB part of the doc to make sure we mention that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we can wrap ClassNotFoundException with more meaningful instructions. Even with good docs, there will be confused users writing to support channels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some checks in the static class/object initialization of the various SMB impl

@@ -1068,7 +1064,8 @@ lazy val `scio-parquet`: Project = project
"org.apache.parquet" % "parquet-hadoop" % parquetVersion,
"org.scala-lang.modules" %% "scala-collection-compat" % scalaCollectionCompatVersion,
"org.slf4j" % "slf4j-api" % slf4jVersion,
"org.tensorflow" % "tensorflow-core-api" % tensorFlowVersion,
// provided
"org.tensorflow" % "tensorflow-core-api" % tensorFlowVersion % Provided,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +1356 to +1362
"org.apache.avro" % "avro" % avroVersion % Provided,
"org.apache.hadoop" % "hadoop-common" % hadoopVersion % Provided,
"org.apache.parquet" % "parquet-avro" % parquetVersion % Provided excludeAll ("org.apache.avro" % "avro"),
"org.apache.parquet" % "parquet-column" % parquetVersion % Provided,
"org.apache.parquet" % "parquet-common" % parquetVersion % Provided,
"org.apache.parquet" % "parquet-hadoop" % parquetVersion % Provided,
"org.tensorflow" % "tensorflow-core-api" % tensorFlowVersion % Provided,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm I do think it's fairly common to have scio-smb but not scio-parquet in your dependencies; I think this change makes sense, but let's definitely call it out in the migration guide.

@codecov
Copy link

codecov bot commented Jun 5, 2023

Codecov Report

Merging #4857 (8d74d4c) into main (de5f3f2) will increase coverage by 0.06%.
The diff coverage is 60.65%.

❗ Current head 8d74d4c differs from pull request most recent head f7ec1a1. Consider uploading reports for the commit f7ec1a1 to get more accurate results

@@            Coverage Diff             @@
##             main    #4857      +/-   ##
==========================================
+ Coverage   62.46%   62.53%   +0.06%     
==========================================
  Files         282      281       -1     
  Lines       10412    10433      +21     
  Branches      776      781       +5     
==========================================
+ Hits         6504     6524      +20     
- Misses       3908     3909       +1     
Impacted Files Coverage Δ
...scala/com/spotify/scio/datastore/DatastoreIO.scala 9.09% <0.00%> (-7.58%) ⬇️
...m/spotify/scio/jdbc/syntax/SCollectionSyntax.scala 33.33% <25.00%> (-66.67%) ⬇️
...sdk/extensions/smb/ParquetTypeFileOperations.scala 96.55% <50.00%> (-3.45%) ⬇️
...m/spotify/scio/jdbc/syntax/ScioContextSyntax.scala 80.00% <66.66%> (-20.00%) ⬇️
.../src/main/scala/com/spotify/scio/jdbc/JdbcIO.scala 72.72% <75.00%> (+45.22%) ⬆️
...tify/scio/datastore/syntax/SCollectionSyntax.scala 100.00% <100.00%> (ø)
...tify/scio/datastore/syntax/ScioContextSyntax.scala 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

@RustedBones RustedBones added this to the 0.13.0 milestone Jun 5, 2023
@RustedBones RustedBones merged commit 74e7afb into main Jun 8, 2023
@RustedBones RustedBones deleted the provided-io-impl branch June 8, 2023 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make tensorflow dependencies provided in scio-parquet? SMB module is pulling all storage implementations
4 participants