#385 document approach for chaining via thrush (#729)

* #385 - document chaing with and without using a thrush * #385 - document chaing with and without using a thrush - wrong example copied... * #385 - document chaing with and without using a thrush - cut not copied... * Update FeatureOverview.md --------- Co-authored-by: Cédric Chantepie <cchantep@users.noreply.github.com>
typelevel · Sep 9, 2023 · 737225b · 737225b
1 parent 180eaf6
commit 737225b
Show file tree

Hide file tree

Showing 2 changed files with 30 additions and 9 deletions.
diff --git a/build.sbt b/build.sbt
@@ -184,7 +184,8 @@ lazy val docs = project
     addCompilerPlugin(
       "org.typelevel" % "kind-projector" % "0.13.2" cross CrossVersion.full
     ),
-    scalacOptions += "-Ydelambdafy:inline"
+    scalacOptions += "-Ydelambdafy:inline",
+    libraryDependencies += "org.typelevel" %% "mouse" % "1.2.1"
   )
   .dependsOn(dataset, cats, ml)
 

diff --git a/docs/FeatureOverview.md b/docs/FeatureOverview.md
@@ -59,6 +59,7 @@ val aptTypedDs2 = aptDs.typed
 ```
 
 ## Typesafe column referencing
+
 This is how we select a particular column from a `TypedDataset`:
 
 ```scala mdoc
@@ -389,7 +390,6 @@ c.select(c.colMany('_1, 'city), c('_2)).show(2).run()
 
 ### Working with collections
 
-
 ```scala mdoc
 import frameless.functions._
 import frameless.functions.nonAggregate._
@@ -416,7 +416,6 @@ in Frameless `explode()` is part of `TypedDataset` and not a function of a colum
 This provides additional safety since more than one `explode()` applied in a single 
 statement results in runtime error in vanilla Spark.   
 
-
 ```scala mdoc
 val t2 = cityRatio.select(cityRatio('city), lit(List(1,2,3,4)))
 val flattened = t2.explode('_2): TypedDataset[(String, Int)]
@@ -434,8 +433,6 @@ to a single column at a time.
 }
 ```
 
-
-
 ### Collecting data to the driver
 
 In Frameless all Spark actions (such as `collect()`) are safe.
@@ -462,7 +459,6 @@ cityBeds.limit(4).collect().run()
 
 ## Sorting columns
 
-
 Only column types that can be sorted are allowed to be selected for sorting. 
 
 ```scala mdoc
@@ -478,7 +474,6 @@ aptTypedDs.orderBy(
 ).show(2).run()
 ```
 
-
 ## User Defined Functions
 
 Frameless supports lifting any Scala function (up to five arguments) to the
@@ -577,7 +572,6 @@ In a DataFrame, if you just ignore types, this would equivelantly be written as:
 bedroomStats.dataset.toDF().filter($"AvgPriceBeds2".isNotNull)
 ```
 
-
 ### Entire TypedDataset Aggregation
 
 We often want to aggregate the entire `TypedDataset` and skip the `groupBy()` clause.
@@ -611,7 +605,6 @@ aptds.agg(
 ).show().run()
 ```
 
-
 ## Joins
 
 ```scala mdoc:silent
@@ -646,6 +639,33 @@ withCityInfo.select(
 ).as[AptPriceCity].show().run
 ```
 
+### Chained Joins
+
+Joins, or any similar operation, may be chained using a thrush combinator removing the need for intermediate values.  Instead of:
+
+```scala mdoc
+val withBedroomInfoInterim = aptTypedDs.joinInner(citiInfoTypedDS)( aptTypedDs('city) === citiInfoTypedDS('name) )
+val withBedroomInfo = withBedroomInfoInterim 
+  .joinLeft(bedroomStats)( withBedroomInfoInterim.col('_1).field('city) === bedroomStats('city) )
+
+withBedroomInfo.show().run()
+```
+
+You can use thrush from [mouse](https://github.com/typelevel/mouse):
+
+```scala
+libraryDependencies += "org.typelevel" %% "mouse" % "1.2.1"
+```
+
+```scala mdoc
+import mouse.all._
+
+val withBedroomInfoChained = aptTypedDs.joinInner(citiInfoTypedDS)( aptTypedDs('city) === citiInfoTypedDS('name) )
+  .thrush( interim => interim.joinLeft(bedroomStats)( interim.col('_1).field('city) === bedroomStats('city) ) )
+
+withBedroomInfoChained.show().run()
+```
+
 ```scala mdoc:invisible
 spark.stop()
 ```