Sort columns #236

frosforever · 2018-01-29T02:49:58Z

Connects to #231 & #225
As per #231 (comment), this PR splits out the sorting of dataset work from #225.
This PR leaves stuct and UserDefinedTypes for a later submission.

This PR takes a slightly different approach from #225 by leveraging the existing CatalystOrdered type class rather than introducing a new CatalystRowOrdered. However, this presents an issue as there are things that are row orderable, that are not comparable. This means that currently (spark 2.2) invalid comparisons are now allowed. This was the original motivation for introducing a separate type class for row ordering. See the conversation around https://github.com/typelevel/frameless/pull/225/files#r158530019.

This also introduces broken support for comparing optional columns. These don't blow up but rather always returns null. This is different from how scala.math.Ordering[Option[A]] works and results in failing property test. It also means that

frameless/dataset/src/main/scala/frameless/TypedColumn.scala

Lines 520 to 530 in f71e0a1

    
           implicit class OrderedTypedColumnSyntax[T, U: CatalystOrdered](col: TypedColumn[T, U]) { 
        
             def <(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped < other.untyped).typed 
        
             def <=(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped <= other.untyped).typed 
        
             def >(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped > other.untyped).typed 
        
             def >=(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped >= other.untyped).typed 
        
             def <(other: U): TypedColumn[T, Boolean] = (col.untyped < lit(other)(col.uencoder).untyped).typed 
        
             def <=(other: U): TypedColumn[T, Boolean] = (col.untyped <= lit(other)(col.uencoder).untyped).typed 
        
             def >(other: U): TypedColumn[T, Boolean] = (col.untyped > lit(other)(col.uencoder).untyped).typed 
        
             def >=(other: U): TypedColumn[T, Boolean] = (col.untyped >= lit(other)(col.uencoder).untyped).typed 
        
           }

can now be used on an optional column and return TypedColumn[T, Boolean] rather than the more appropriate Option[Boolean]. This can possibly be fixed with a LowPriorityImplicit pattern.

Alternatively, we can leave out Option support, remove the descNoneFirst etc apis and the ability to sort datasets using Option columns which would effectively be #231. Might as well merge that one.

frosforever · 2018-01-29T02:51:03Z

dataset/src/test/scala/frameless/ColumnTests.scala

@@ -45,6 +45,22 @@ class ColumnTests extends TypedDatasetSuite {
    check(forAll(prop[SQLTimestamp] _))
    check(forAll(prop[String] _))
    check(forAll(prop[Instant] _))
+
+      /*
+      Fails at runtime!!


This would currently fail in spark 2.2 though testing on spark 2.3-SNAPSHOT seems to pass. Not sure how to handle this in the meantime.

frosforever · 2018-01-29T02:51:31Z

dataset/src/test/scala/frameless/ColumnTests.scala

+//    check(forAll(prop[List[Int]] _)) //Fails at runtime!!
+
+    /*
+    Fails test!!


Introducing Option support for comparable is not as straight forward.

@frosforever How about we keep this PR focused on sorting basic columns?

We can always do the Option comparison in another PR. Same for collections.

codecov-io · 2018-01-29T13:58:00Z

Codecov Report

Merging #236 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
+ Coverage   96.96%   96.99%   +0.02%     
==========================================
  Files          52       52              
  Lines         858      866       +8     
  Branches       12       10       -2     
==========================================
+ Hits          832      840       +8     
  Misses         26       26

Impacted Files	Coverage Δ
...ataset/src/main/scala/frameless/TypedDataset.scala	`100% <ø> (ø)`	⬆️
dataset/src/main/scala/frameless/TypedColumn.scala	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 576eb67...cc288aa. Read the comment docs.

frosforever · 2018-01-29T14:00:51Z

@imarios I've removed the option and collection sorting which makes this a near copy of #231 with the slight addition of the default ascending sort order. Feel free to reject this in favor of #231 if you think it more appropriate.
Would love to hear your input on tackling Option sorting!

imarios · 2018-01-30T05:38:56Z

dataset/src/main/scala/frameless/TypedDataset.scala

+      TypedDataset.create[T](sorted)
+    }
+  }
+


@frosforever, we can do something similar to what we did with orderByMany for sortWithinPartitions? Maybe even call it sortWithinPartitionsMany and have some sortWithinPartitions overloaded for up to 3 args.

imarios · 2018-01-30T05:40:48Z

dataset/src/main/scala/frameless/TypedColumn.scala

+    */
+  def desc(implicit catalystOrdered: CatalystOrdered[U]): SortedTypedColumn[T, U] =
+    new SortedTypedColumn[T, U](withExpr {
+      SortOrder(expr, Descending)


I might prefer this (FramelessInternals.expr(untyped.desc)) slightly more since it uses the externally facing desc method rather than the "internal" expression SortOrder. What do you think?

imarios · 2018-01-30T05:46:01Z

@frosforever no way, this looks better for sure :).

the merge conflict must be quite easy to resolve. let me know if you need any help. Let me think about sorting for Options/Collections.

imarios · 2018-02-07T17:31:07Z

@frosforever I am putting my project manager hat on :D and dare to ask about fixing test coverage and merge conflicts. this is blocking 0.5 haha (let me know if you need any help, I can probably work out of your branch)

frosforever · 2018-02-07T18:33:12Z

@imarios Gah! So sorry for the delay on this, I've been pretty busy recently. I'll do my very best to get on this by today, if not the end of this weekend. If there's a need for this sooner, I'm happy to let someone else finish it.

At the very least, I've fixed the merge conflicts.

frosforever · 2018-02-08T14:46:15Z

@imarios fixed code coverage and added requested methods

imarios · 2018-02-08T16:01:59Z

dataset/src/main/scala/frameless/TypedColumn.scala

+    * apache/spark
+    */
+  def asc(implicit catalystOrdered: CatalystOrdered[U]): SortedTypedColumn[T, U] =
+    new SortedTypedColumn[T, U](FramelessInternals.expr(untyped.asc))


This one might work: new SortedTypedColumn[T, U](untyped.asc)

imarios · 2018-02-08T16:04:43Z

dataset/src/main/scala/frameless/TypedColumn.scala

+
+object SortedTypedColumn {
+  implicit def defaultAscending[T, U : CatalystOrdered](typedColumn: TypedColumn[T, U]): SortedTypedColumn[T, U] =
+      new SortedTypedColumn[T, U](new Column(SortOrder(typedColumn.expr, Ascending)))(typedColumn.uencoder)


If this works and looks simpler:
new SortedTypedColumn[T, U](typedColumn.untyped.asc)(typedColumn.uencoder)

how frustrating. sorry. fixed.

frosforever · 2018-02-08T16:36:48Z

@imarios done.
I highly recommend a squash merge when it's time. This PR has gone all over the place.

imarios · 2018-02-09T03:46:47Z

@frosforever great work! I love it

frosforever commented Jan 29, 2018

View reviewed changes

frosforever mentioned this pull request Jan 29, 2018

[Can be replaced by parts of #225 - Do NOT merge] Adding orderBy methods for TypedDetasets #231

Closed

imarios reviewed Jan 30, 2018

View reviewed changes

imarios added this to the 0.5-release milestone Jan 30, 2018

frosforever added 6 commits February 7, 2018 13:31

Add dataset orderBy

e2c4b9e

flush out tests a bit

4448951

fail compile test on unsortable

66e5029

add comment about currently failling tests with CatalystOrdered

cf40f4b

clean up

c434660

gut option and collection sorting

5d82a06

frosforever force-pushed the sort-columns branch from dd6ec63 to 5d82a06 Compare February 7, 2018 18:33

frosforever added 4 commits February 7, 2018 13:41

use frameless internals rather than explicit SortOrder

1be4288

add sortWithinPartitionN and Many

7df2ae6

add sortWithinPartitionsTest

22d432b

add tests for default ordering in Many sorts

5102003

imarios reviewed Feb 8, 2018

View reviewed changes

clean up untyped expr

e866091

and another removal of explicit SortOrder

cc288aa

imarios merged commit ba9abbe into typelevel:master Feb 9, 2018

frosforever deleted the sort-columns branch February 9, 2018 18:48

This was referenced Feb 10, 2018

Window dense rank #248

Closed

[Final] More examples to highlight new 0.5 features. #241

Merged

frosforever mentioned this pull request Apr 10, 2018

Allow column operations on 'Option[X]' that are valid for 'X' #204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort columns #236

Sort columns #236

frosforever commented Jan 29, 2018 •

edited

Loading

frosforever Jan 29, 2018

frosforever Jan 29, 2018

imarios Jan 29, 2018 •

edited

Loading

imarios Jan 29, 2018 •

edited

Loading

codecov-io commented Jan 29, 2018 •

edited

Loading

frosforever commented Jan 29, 2018

imarios Jan 30, 2018

imarios Jan 30, 2018

imarios commented Jan 30, 2018 •

edited

Loading

imarios commented Feb 7, 2018 •

edited

Loading

frosforever commented Feb 7, 2018 •

edited

Loading

frosforever commented Feb 8, 2018

imarios Feb 8, 2018

imarios Feb 8, 2018

imarios Feb 8, 2018

imarios Feb 8, 2018

frosforever Feb 8, 2018

frosforever commented Feb 8, 2018

imarios commented Feb 9, 2018

	implicit class OrderedTypedColumnSyntax[T, U: CatalystOrdered](col: TypedColumn[T, U]) {
	def <(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped < other.untyped).typed
	def <=(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped <= other.untyped).typed
	def >(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped > other.untyped).typed
	def >=(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped >= other.untyped).typed

	def <(other: U): TypedColumn[T, Boolean] = (col.untyped < lit(other)(col.uencoder).untyped).typed
	def <=(other: U): TypedColumn[T, Boolean] = (col.untyped <= lit(other)(col.uencoder).untyped).typed
	def >(other: U): TypedColumn[T, Boolean] = (col.untyped > lit(other)(col.uencoder).untyped).typed
	def >=(other: U): TypedColumn[T, Boolean] = (col.untyped >= lit(other)(col.uencoder).untyped).typed
	}

Sort columns #236

Sort columns #236

Conversation

frosforever commented Jan 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imarios Jan 29, 2018 • edited Loading

Choose a reason for hiding this comment

imarios Jan 29, 2018 • edited Loading

Choose a reason for hiding this comment

codecov-io commented Jan 29, 2018 • edited Loading

Codecov Report

frosforever commented Jan 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imarios commented Jan 30, 2018 • edited Loading

imarios commented Feb 7, 2018 • edited Loading

frosforever commented Feb 7, 2018 • edited Loading

frosforever commented Feb 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frosforever commented Feb 8, 2018

imarios commented Feb 9, 2018

frosforever commented Jan 29, 2018 •

edited

Loading

imarios Jan 29, 2018 •

edited

Loading

imarios Jan 29, 2018 •

edited

Loading

codecov-io commented Jan 29, 2018 •

edited

Loading

imarios commented Jan 30, 2018 •

edited

Loading

imarios commented Feb 7, 2018 •

edited

Loading

frosforever commented Feb 7, 2018 •

edited

Loading