-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort columns #236
Sort columns #236
Conversation
@@ -45,6 +45,22 @@ class ColumnTests extends TypedDatasetSuite { | |||
check(forAll(prop[SQLTimestamp] _)) | |||
check(forAll(prop[String] _)) | |||
check(forAll(prop[Instant] _)) | |||
|
|||
/* | |||
Fails at runtime!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would currently fail in spark 2.2
though testing on spark 2.3-SNAPSHOT
seems to pass. Not sure how to handle this in the meantime.
// check(forAll(prop[List[Int]] _)) //Fails at runtime!! | ||
|
||
/* | ||
Fails test!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing Option
support for comparable is not as straight forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frosforever How about we keep this PR focused on sorting basic columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always do the Option comparison in another PR. Same for collections.
Codecov Report
@@ Coverage Diff @@
## master #236 +/- ##
==========================================
+ Coverage 96.96% 96.99% +0.02%
==========================================
Files 52 52
Lines 858 866 +8
Branches 12 10 -2
==========================================
+ Hits 832 840 +8
Misses 26 26
Continue to review full report at Codecov.
|
TypedDataset.create[T](sorted) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frosforever, we can do something similar to what we did with orderByMany
for sortWithinPartitions
? Maybe even call it sortWithinPartitionsMany
and have some sortWithinPartitions overloaded for up to 3 args.
*/ | ||
def desc(implicit catalystOrdered: CatalystOrdered[U]): SortedTypedColumn[T, U] = | ||
new SortedTypedColumn[T, U](withExpr { | ||
SortOrder(expr, Descending) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might prefer this (FramelessInternals.expr(untyped.desc))
slightly more since it uses the externally facing desc method rather than the "internal" expression SortOrder
. What do you think?
@frosforever no way, this looks better for sure :). the merge conflict must be quite easy to resolve. let me know if you need any help. Let me think about sorting for Options/Collections. |
@frosforever I am putting my project manager hat on :D and dare to ask about fixing test coverage and merge conflicts. this is blocking 0.5 haha (let me know if you need any help, I can probably work out of your branch) |
@imarios Gah! So sorry for the delay on this, I've been pretty busy recently. I'll do my very best to get on this by today, if not the end of this weekend. If there's a need for this sooner, I'm happy to let someone else finish it. At the very least, I've fixed the merge conflicts. |
dd6ec63
to
5d82a06
Compare
@imarios fixed code coverage and added requested methods |
* apache/spark | ||
*/ | ||
def asc(implicit catalystOrdered: CatalystOrdered[U]): SortedTypedColumn[T, U] = | ||
new SortedTypedColumn[T, U](FramelessInternals.expr(untyped.asc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one might work: new SortedTypedColumn[T, U](untyped.asc)
|
||
object SortedTypedColumn { | ||
implicit def defaultAscending[T, U : CatalystOrdered](typedColumn: TypedColumn[T, U]): SortedTypedColumn[T, U] = | ||
new SortedTypedColumn[T, U](new Column(SortOrder(typedColumn.expr, Ascending)))(typedColumn.uencoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this works and looks simpler:
new SortedTypedColumn[T, U](typedColumn.untyped.asc)(typedColumn.uencoder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how frustrating. sorry. fixed.
@imarios done. |
@frosforever great work! I love it |
Connects to #231 & #225
As per #231 (comment), this PR splits out the sorting of dataset work from #225.
This PR leaves
stuct
andUserDefinedTypes
for a later submission.This PR takes a slightly different approach from #225 by leveraging the existing
CatalystOrdered
type class rather than introducing a newCatalystRowOrdered
. However, this presents an issue as there are things that are row orderable, that are not comparable. This means that currently (spark 2.2) invalid comparisons are now allowed. This was the original motivation for introducing a separate type class for row ordering. See the conversation around https://github.com/typelevel/frameless/pull/225/files#r158530019.This also introduces broken support for comparing optional columns. These don't blow up but rather always returns null. This is different from how
scala.math.Ordering[Option[A]]
works and results in failing property test. It also means thatframeless/dataset/src/main/scala/frameless/TypedColumn.scala
Lines 520 to 530 in f71e0a1
TypedColumn[T, Boolean]
rather than the more appropriateOption[Boolean]
. This can possibly be fixed with aLowPriorityImplicit
pattern.Alternatively, we can leave out
Option
support, remove thedescNoneFirst
etc apis and the ability to sort datasets usingOption
columns which would effectively be #231. Might as well merge that one.