Skip to content

Commit

Permalink
#385 document approach for chaining via thrush (#729)
Browse files Browse the repository at this point in the history
* #385 - document chaing with and without using a thrush

* #385 - document chaing with and without using a thrush - wrong example copied...

* #385 - document chaing with and without using a thrush - cut not copied...

* Update FeatureOverview.md

---------

Co-authored-by: Cédric Chantepie <cchantep@users.noreply.github.com>
  • Loading branch information
chris-twiner and cchantep committed Sep 9, 2023
1 parent 180eaf6 commit 737225b
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 9 deletions.
3 changes: 2 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,8 @@ lazy val docs = project
addCompilerPlugin(
"org.typelevel" % "kind-projector" % "0.13.2" cross CrossVersion.full
),
scalacOptions += "-Ydelambdafy:inline"
scalacOptions += "-Ydelambdafy:inline",
libraryDependencies += "org.typelevel" %% "mouse" % "1.2.1"
)
.dependsOn(dataset, cats, ml)

Expand Down
36 changes: 28 additions & 8 deletions docs/FeatureOverview.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ val aptTypedDs2 = aptDs.typed
```

## Typesafe column referencing

This is how we select a particular column from a `TypedDataset`:

```scala mdoc
Expand Down Expand Up @@ -389,7 +390,6 @@ c.select(c.colMany('_1, 'city), c('_2)).show(2).run()

### Working with collections


```scala mdoc
import frameless.functions._
import frameless.functions.nonAggregate._
Expand All @@ -416,7 +416,6 @@ in Frameless `explode()` is part of `TypedDataset` and not a function of a colum
This provides additional safety since more than one `explode()` applied in a single
statement results in runtime error in vanilla Spark.


```scala mdoc
val t2 = cityRatio.select(cityRatio('city), lit(List(1,2,3,4)))
val flattened = t2.explode('_2): TypedDataset[(String, Int)]
Expand All @@ -434,8 +433,6 @@ to a single column at a time.
}
```



### Collecting data to the driver

In Frameless all Spark actions (such as `collect()`) are safe.
Expand All @@ -462,7 +459,6 @@ cityBeds.limit(4).collect().run()

## Sorting columns


Only column types that can be sorted are allowed to be selected for sorting.

```scala mdoc
Expand All @@ -478,7 +474,6 @@ aptTypedDs.orderBy(
).show(2).run()
```


## User Defined Functions

Frameless supports lifting any Scala function (up to five arguments) to the
Expand Down Expand Up @@ -577,7 +572,6 @@ In a DataFrame, if you just ignore types, this would equivelantly be written as:
bedroomStats.dataset.toDF().filter($"AvgPriceBeds2".isNotNull)
```


### Entire TypedDataset Aggregation

We often want to aggregate the entire `TypedDataset` and skip the `groupBy()` clause.
Expand Down Expand Up @@ -611,7 +605,6 @@ aptds.agg(
).show().run()
```


## Joins

```scala mdoc:silent
Expand Down Expand Up @@ -646,6 +639,33 @@ withCityInfo.select(
).as[AptPriceCity].show().run
```

### Chained Joins

Joins, or any similar operation, may be chained using a thrush combinator removing the need for intermediate values. Instead of:

```scala mdoc
val withBedroomInfoInterim = aptTypedDs.joinInner(citiInfoTypedDS)( aptTypedDs('city) === citiInfoTypedDS('name) )
val withBedroomInfo = withBedroomInfoInterim
.joinLeft(bedroomStats)( withBedroomInfoInterim.col('_1).field('city) === bedroomStats('city) )

withBedroomInfo.show().run()
```

You can use thrush from [mouse](https://github.com/typelevel/mouse):

```scala
libraryDependencies += "org.typelevel" %% "mouse" % "1.2.1"
```

```scala mdoc
import mouse.all._

val withBedroomInfoChained = aptTypedDs.joinInner(citiInfoTypedDS)( aptTypedDs('city) === citiInfoTypedDS('name) )
.thrush( interim => interim.joinLeft(bedroomStats)( interim.col('_1).field('city) === bedroomStats('city) ) )

withBedroomInfoChained.show().run()
```

```scala mdoc:invisible
spark.stop()
```

0 comments on commit 737225b

Please sign in to comment.