#787 - Move encoder implementation details to external shim library (not dependent on the Spark 4 release) #800

chris-twiner · 2024-03-01T12:58:21Z

per #787 and #300 - The key files and code that has changed since the 2.4->3 migration is abstracted out and moved to arity based shims and helper functions. (this includes Spark 4 snapshot changes - see below). If the approach and scope of Shim usage within Frameless is ok, I'll push out an actual 0.0.1 cut (excluding Spark4), in the meantime the snaps are on central.

Frameless' use of the internal apis for extending or mixing in is thusfar (aside from the lit pushdown issue) isolated to the shim interfaces for encoding usage (and creating analysisexception). As such a version compiled from the shimmed 0.16 against 3.5.0 allows encoding to be used on 14.3 LTS (3.5 with 4.0 StaticInvoke) down to 9.1 LTS (3.1.3) without issue.

of note - I've run Quality tests against Spark 4 SNAPSHOT proper (with the Frameless3.5 build) and in a local build of Spark 4 Frameless all tests pass - although cats tests sometimes freeze when run directly (on "inner pairwise monoid") (I've also run the Quality tests against Frameless built against 4).

Lit and UDF on base Expression are (as of 29th Feb) are stable api wise against Spark 4 so there is no need for shims there.

NB: I've only pushed a Spark 4 shim_runtime against 2.13 into snapshots, I won't push any full versions until RC's drop

If the approach / change scope is ok I'll push release versions of shim out and remove the resolvers:

// needed for shim_runtim snapshots
resolvers in Global += MavenRepository(
  "sonatype-s01-snapshots",
  Resolver.SonatypeS01RepositoryRoot + "/snapshots"
)
// needed for 4.0 snapshots
resolvers in Global += MavenRepository(
  "apache_snaps",
  "https://repository.apache.org/content/repositories/snapshots"
)

per #787 - let me know if I should shim something else (I don't see Dataset/SparkSession etc. as being useful, but I'd be happy to move the scala reflection stuff over to shim, not that it's currently needed)

NB (Spark 4 changes:

build sbt needs the correct shim_runtime_4.0.0.oss_4.0 dependency and 2.13 main scala version (as well as jdk 17/21) . Comments have the versions.

Source (changes compatible with 0.16 builds):

Swapped FramelessInternals to use a shim to create an AnalysisException. (different args on Spark 4)
TypedColumn gets actual imports from org.apache.spark.sql.catalyst.expressions (new With Expression in Spark 4)
Pushdown tests needs to use the previous currentTimestamp code (Spark 4 removed it, could shim this if preferred)
SchemaTests.structToNonNullable resets the metadata, Spark 4 sets meta data so the properties don't hold and you'll get:

Expected StructType(StructField(_1,LongType,false),StructField(_2,LongType,false)) but got StructType(StructField(_1,LongType,false),StructField(_2,LongType,false))

In order to run tests on jdk 17/21 you'll need this adding to the vm args:

--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
--add-opens=java.base/java.io=ALL-UNNAMED
--add-opens=java.base/java.net=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED
--add-opens=java.base/sun.security.action=ALL-UNNAMED
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED

)

…ifisnull, GetColumnByOrdinal, MapObjects and TypedExpressionEncoder shimmed

…ifisnull, GetColumnByOrdinal, MapObjects and TypedExpressionEncoder shimmed - attempt build

codecov · 2024-03-01T13:25:43Z

Codecov Report

Attention: Patch coverage is 96.81529% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 95.60%. Comparing base (0fb9c58) to head (986891a).
Report is 1 commits behind head on master.

❗ Current head 986891a differs from pull request most recent head 25cc5c3. Consider uploading reports for the commit 25cc5c3 to get more accurate results

Files	Patch %	Lines
...taset/src/main/scala/frameless/RecordEncoder.scala	92.85%	5 Missing ⚠️
...taset/src/main/scala/frameless/functions/Udf.scala	93.10%	4 Missing ⚠️
.../src/main/scala/frameless/FramelessInternals.scala	96.42%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #800      +/-   ##
==========================================
+ Coverage   95.46%   95.60%   +0.13%     
==========================================
  Files          67       65       -2     
  Lines        1257     1341      +84     
  Branches       42       52      +10     
==========================================
+ Hits         1200     1282      +82     
- Misses         57       59       +2

Flag	Coverage Δ
2.12-root-spark33	`95.30% <95.54%> (-0.16%)`	⬇️
2.12-root-spark34	`?`
2.12-root-spark35	`95.52% <96.49%> (+0.06%)`	⬆️
2.13-root-spark35	`96.08% <97.10%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cchantep · 2024-03-02T22:03:09Z

Is there an issue with the existing build?
Otherwise I'm not sure we want to introduce such upstream dependency.

chris-twiner · 2024-03-03T10:35:46Z

Is there an issue with the existing build? Otherwise I'm not sure we want to introduce such upstream dependency.

As described in #787 the 14.2 and 14.3 LTS Databricks Runtime cannot use Frameless 0.16 due to backporting Spark 4 changes. The core of the encoding derivation logic, at least, is essentially identical since 2.4 days / 2020 when I started making commits to support newer versions/runtimes, what changes is internal spark api usage.

This PR is aims for a hot swappable jar based solution to such changes and would reduce dependency on the core of frameless (including committers) to support such runtime differences i.e. focus on oss major releases only * and have the runtime compatibility issues pushed to another library.

* This is not strictly required for encoding either, per above a 0.16+shim Frameless base can encode on all versions from 3.1.3 through to 4.0 nightlies/14.3 DBR by swapping the runtime

pomadchin · 2024-03-03T13:46:55Z

Yes, feels like we def need some kind of shims to simplify life of the DB folks who use frameless.

…o temp/787_shim

…teStruct, and allow shims for the deprecated functions

…se rc1

chris-twiner · 2024-03-27T11:09:56Z

per b880261, proper fix for #803 and #804 are confirmed as working on all LTS versions of Databricks, Spark 4 and the latest 15.0 runtime - test combinations are documented here

…clusters

…clusters - inifinity protection

…0 databricks doesn't process them on ordered dataset

chris-twiner · 2024-04-12T19:34:59Z

NB 25cc5c3 test cases are able to run on a full cluster 14.3 and 15.0, most fails are due to randomness rather than actual test code issues now.

pomadchin · 2024-06-18T11:38:00Z

omg that's huge; I'll try to get back to you soonish!

pomadchin · 2024-07-13T17:02:38Z

RE: Spark4: If Spark 4 is such a breaking change we can just release frameless-spark4 artifacts; i.e. that's how tapir maintains play2.9 libs

RE: DBR compat layer: This PR is hard to review 🤔 we need to figure smth with fmt.

pomadchin

I started doing some reviews, fmt is not our best friend, but the commit history is! 🎉

pomadchin · 2024-07-13T16:56:39Z

cats/src/test/scala/frameless/cats/test.scala

+  implicit class seqToRdd[A: ClassTag](
+      seq: Seq[A]
+    )(implicit
+      sc: SC) {


I wonder, is it how scalafmt formats it?

the builds won't run unless scalafmt is run on files, so yes, those settings lead to some hideous looking code.

pomadchin · 2024-07-13T16:56:58Z

cats/src/test/scala/frameless/cats/test.scala

+      val expectedSum =
+        if (seq.isEmpty) None else Some(Foldable[List].fold(seq))


fmt is super weird

pomadchin · 2024-07-13T17:06:00Z

dataset/src/main/spark-3.4+/frameless/MapGroups.scala

-package frameless
-
-import org.apache.spark.sql.Encoder
-import org.apache.spark.sql.catalyst.expressions.Attribute
-import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, MapGroups => SMapGroups}
-
-object MapGroups {


Q. What makes it impossible to keep these MapGroups here, and use the override from the shim in the client code?

The client code will pick up the version from shim and it will enforce MapGroups implementation from the dependency.

We can play around with it thats for sure, but I am curious how we can get it as far as possible to the user side not enforcing behavior / shims usage by the lib implementation.

^ Also given how convoluted https://github.com/sparkutils/shim looks, id really like to find a solution so user replaces / uses their own implementation of the critical classes (i.e. by pulling shims into deps)

^ Also given how convoluted https://github.com/sparkutils/shim looks, id really like to find a solution so user replaces / uses their own implementation of the critical classes (i.e. by pulling shims into deps)

that's the point of the shim lib. When a user wants an upgrade (and the shim is there) they use the newer shim, no need to change the frameless version used. The complexity is there, it's just about how to manage it, I'd personally prefer it is in a single lib rather than in client code. Currently only the testing is version specific.

You could, per the other point, move the call to impl out into a typeclass , this wouldn't impact the users source all that much if a default impl passing to shim existed. (a recompile would be needed of course for the extra implicit). That could give the best of both worlds - point location fixes and a bunch of default fixes.

chris-twiner · 2024-07-15T14:19:35Z

RE: Spark4: If Spark 4 is such a breaking change we can just release frameless-spark4 artifacts; i.e. that's how tapir maintains play2.9 libs

This just pushes the problem out again and keeps frameless focussed on spark differences. Along comes 4.1.0 and breaks more or 5 or ...

RE: DBR compat layer: This PR is hard to review 🤔 we need to figure smth with fmt.

yeah, there should have been a reformat all files PR immediately after adding it in, ah well, hindsight etc.

pomadchin · 2024-07-15T14:35:27Z

@chris-twiner

This just pushes the problem out again and keeps frameless focussed on spark differences. Along comes 4.1.0 and breaks more or 5 or ...

So this library is dependent on Spark, so I'd say we do the breaking change and move frameless to Spark 4, and keep maintenance releases if users ask for them.

Maintaining cross major version releases via a shims lib could be a bit too much.

BTW the same concerns are around the DB runtime:

how many versions are we gonna support?
who supports them?
can users support their own DB version runtime that is not yet published?

MB we could have some abstraction that users could override themselves so we shift this responsibility to the user?

chris-twiner · 2024-07-15T14:45:29Z

So this library is dependent on Spark, so I'd say we do the breaking change and move frameless to Spark 4, and keep maintenance releases if users ask for them.

Maintaining cross major version releases via a shims lib could be a bit too much.

BTW the same concerns are around the DB runtime:

how many versions are we gonna support?

who supports them?

can users support their own DB version runtime that is not yet published?

MB we could have some abstraction that users could override themselves so we shift this responsibility to the user?

I'll answer the last first - if it's typeclass based they can throw their own in - each location get's it (not sure this works for all off the top of my head but I could try it out).

Version wise - frameless should only support it's usage interface, if someone -e.g. me- requires a funny version of frameless to work with an obscurity of databricks then let them provide the shim PR (or per above custom typeclass) - externalise the problem. The only things I'm aware of which aren't easy to take this approach with are changes which stop udf's or lit's working (the final gencode springs to mind as does the foldable pushdown issue) and of course the whole agnostic encoder mess.

If you think the idea of using typeclasses to isolate this sounds reasonable I'm happy to give it a go.

chris-twiner added 16 commits September 29, 2023 10:41

typelevel#755 - correct version number in readme for non 3.5 build

d3ddaf1

Merge branch 'master' of github.com:typelevel/frameless

a435adc

Merge branch 'master' of github.com:typelevel/frameless

24bde95

typelevel#787 - base required for shim and 14.3.dbr

cb259fa

typelevel#787 - [Un]WrapOption, Invoke, NewInstance, GetStructField, …

b8d4f05

…ifisnull, GetColumnByOrdinal, MapObjects and TypedExpressionEncoder shimmed

typelevel#787 - [Un]WrapOption, Invoke, NewInstance, GetStructField, …

7944fe9

…ifisnull, GetColumnByOrdinal, MapObjects and TypedExpressionEncoder shimmed - attempt build

typelevel#787 - forced reformatting

71bb38c

typelevel#787 - forced reformatting

c843c6a

typelevel#787 - forced reformatting

9a0c55b

typelevel#787 - mima MapGroups removal

0616953

typelevel#787 - Spark 4 starter pack

a70d5c3

typelevel#787 - Spark 4 starter pack

1ef1d9b

typelevel#787 - Spark 4 starter pack

7a96748

typelevel#787 - Spark 4 starter pack, doh

7d0e131

typelevel#787 - resolve conflict for auto merge

c6a4341

Merge branch 'master' into temp/787_shim

025ee64

pomadchin requested review from pomadchin and cchantep March 2, 2024 18:59

chris-twiner added 8 commits March 7, 2024 16:08

Merge branch 'temp/787_shim' of github.com:chris-twiner/frameless int…

7b17009

…o temp/787_shim

Merge remote-tracking branch 'frameless/master' into temp/787_shim

6717e4b

typelevel#787 - reduce the case class api usage even further and Crea…

4933a90

…teStruct, and allow shims for the deprecated functions

typelevel#787 - disable local maven again

0f9b7cf

typelevel#787 - remove all sql package private code

059a8e6

typelevel#787 - remove all sql package private code

9c506df

typelevel#787 - ml internals removal - all public - typelevel#300

11aece0

typelevel#787 - ml internals removal - all public - typelevel#300 - u…

089cb3a

…se rc1

chris-twiner added 10 commits March 20, 2024 22:15

Merge remote-tracking branch 'upstream/master' into temp/804_clean

c792c05

typelevel#804 - rebased

f0d5f16

typelevel#803 - clean udf from typelevel#804, no shim start

3bdb8ad

typelevel#803 - clean udf eval needs typelevel#804

c2f3492

typelevel#803 - clean udf eval needs typelevel#804

08d7c3d

(cherry picked from commit 3bdb8ad)

b82d266

(cherry picked from commit c2f3492)

e36eac2

typelevel#787 - merge typelevel#803 / typelevel#804

aa1e6de

typelevel#787 - Seq can be stream, fails on dbr, do the same as for arb

be4c35e

typelevel#787 typelevel#804 - stream

b880261

chris-twiner added 10 commits April 10, 2024 14:15

typelevel#787 - tests have ordering and precision issues when run on …

f793fc7

…clusters

typelevel#787 - tests have ordering and precision issues when run on …

e582962

…clusters

typelevel#787 - tests have ordering and precision issues when run on …

986891a

…clusters - inifinity protection

typelevel#787 - attempt to solve all but covar_pop and kurtosis

66b31e9

typelevel#787 - attempt covar_pop and kurtosis through tolerances

80de4f2

typelevel#787 - tolerance on map members and on vectors for cluster runs

a89542e

typelevel#787 - pivottest was random ordering

271e953

typelevel#787 - ensure last/first are run on a single partition - 15.…

fa75889

…0 databricks doesn't process them on ordered dataset

typelevel#787 - ensure last/first are run on a single partition - 15.…

b6189b1

…0 databricks doesn't process them on ordered dataset

typelevel#787 - ensure last/first are run on a single partition - 15.…

25cc5c3

…0 databricks doesn't process them on ordered dataset

pomadchin reviewed Jul 13, 2024

View reviewed changes

This was referenced Aug 28, 2024

Spark 4.0 / DBR 14.2+ - bleeding edge changes #787

Open

NoSuchMethodError in sparksql35-scalapb0_11 after update scalapb/sparksql-scalapb#385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#787 - Move encoder implementation details to external shim library (not dependent on the Spark 4 release) #800

#787 - Move encoder implementation details to external shim library (not dependent on the Spark 4 release) #800

chris-twiner commented Mar 1, 2024

codecov bot commented Mar 1, 2024 •

edited

Loading

cchantep commented Mar 2, 2024

chris-twiner commented Mar 3, 2024

pomadchin commented Mar 3, 2024

chris-twiner commented Mar 27, 2024

chris-twiner commented Apr 12, 2024

pomadchin commented Jun 18, 2024 •

edited

Loading

pomadchin commented Jul 13, 2024

pomadchin left a comment

pomadchin Jul 13, 2024

chris-twiner Jul 15, 2024

pomadchin Jul 13, 2024

pomadchin Jul 13, 2024

pomadchin Jul 13, 2024

chris-twiner Jul 15, 2024

chris-twiner commented Jul 15, 2024

pomadchin commented Jul 15, 2024

chris-twiner commented Jul 15, 2024

		val expectedSum =
		if (seq.isEmpty) None else Some(Foldable[List].fold(seq))

#787 - Move encoder implementation details to external shim library (not dependent on the Spark 4 release) #800

Are you sure you want to change the base?

#787 - Move encoder implementation details to external shim library (not dependent on the Spark 4 release) #800

Conversation

chris-twiner commented Mar 1, 2024

codecov bot commented Mar 1, 2024 • edited Loading

Codecov Report

cchantep commented Mar 2, 2024

chris-twiner commented Mar 3, 2024

pomadchin commented Mar 3, 2024

chris-twiner commented Mar 27, 2024

chris-twiner commented Apr 12, 2024

pomadchin commented Jun 18, 2024 • edited Loading

pomadchin commented Jul 13, 2024

pomadchin left a comment

Choose a reason for hiding this comment

pomadchin Jul 13, 2024

Choose a reason for hiding this comment

chris-twiner Jul 15, 2024

Choose a reason for hiding this comment

pomadchin Jul 13, 2024

Choose a reason for hiding this comment

pomadchin Jul 13, 2024

Choose a reason for hiding this comment

pomadchin Jul 13, 2024

Choose a reason for hiding this comment

chris-twiner Jul 15, 2024

Choose a reason for hiding this comment

chris-twiner commented Jul 15, 2024

pomadchin commented Jul 15, 2024

chris-twiner commented Jul 15, 2024

codecov bot commented Mar 1, 2024 •

edited

Loading

pomadchin commented Jun 18, 2024 •

edited

Loading