Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort columns #236

Merged
merged 12 commits into from
Feb 9, 2018
Merged

Sort columns #236

merged 12 commits into from
Feb 9, 2018

Conversation

frosforever
Copy link
Contributor

@frosforever frosforever commented Jan 29, 2018

Connects to #231 & #225
As per #231 (comment), this PR splits out the sorting of dataset work from #225.
This PR leaves stuct and UserDefinedTypes for a later submission.

This PR takes a slightly different approach from #225 by leveraging the existing CatalystOrdered type class rather than introducing a new CatalystRowOrdered. However, this presents an issue as there are things that are row orderable, that are not comparable. This means that currently (spark 2.2) invalid comparisons are now allowed. This was the original motivation for introducing a separate type class for row ordering. See the conversation around https://github.com/typelevel/frameless/pull/225/files#r158530019.

This also introduces broken support for comparing optional columns. These don't blow up but rather always returns null. This is different from how scala.math.Ordering[Option[A]] works and results in failing property test. It also means that

implicit class OrderedTypedColumnSyntax[T, U: CatalystOrdered](col: TypedColumn[T, U]) {
def <(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped < other.untyped).typed
def <=(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped <= other.untyped).typed
def >(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped > other.untyped).typed
def >=(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = (col.untyped >= other.untyped).typed
def <(other: U): TypedColumn[T, Boolean] = (col.untyped < lit(other)(col.uencoder).untyped).typed
def <=(other: U): TypedColumn[T, Boolean] = (col.untyped <= lit(other)(col.uencoder).untyped).typed
def >(other: U): TypedColumn[T, Boolean] = (col.untyped > lit(other)(col.uencoder).untyped).typed
def >=(other: U): TypedColumn[T, Boolean] = (col.untyped >= lit(other)(col.uencoder).untyped).typed
}
can now be used on an optional column and return TypedColumn[T, Boolean] rather than the more appropriate Option[Boolean]. This can possibly be fixed with a LowPriorityImplicit pattern.

Alternatively, we can leave out Option support, remove the descNoneFirst etc apis and the ability to sort datasets using Option columns which would effectively be #231. Might as well merge that one.

@@ -45,6 +45,22 @@ class ColumnTests extends TypedDatasetSuite {
check(forAll(prop[SQLTimestamp] _))
check(forAll(prop[String] _))
check(forAll(prop[Instant] _))

/*
Fails at runtime!!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would currently fail in spark 2.2 though testing on spark 2.3-SNAPSHOT seems to pass. Not sure how to handle this in the meantime.

// check(forAll(prop[List[Int]] _)) //Fails at runtime!!

/*
Fails test!!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing Option support for comparable is not as straight forward.

Copy link
Contributor

@imarios imarios Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frosforever How about we keep this PR focused on sorting basic columns?

Copy link
Contributor

@imarios imarios Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always do the Option comparison in another PR. Same for collections.

@codecov-io
Copy link

codecov-io commented Jan 29, 2018

Codecov Report

Merging #236 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
+ Coverage   96.96%   96.99%   +0.02%     
==========================================
  Files          52       52              
  Lines         858      866       +8     
  Branches       12       10       -2     
==========================================
+ Hits          832      840       +8     
  Misses         26       26
Impacted Files Coverage Δ
...ataset/src/main/scala/frameless/TypedDataset.scala 100% <ø> (ø) ⬆️
dataset/src/main/scala/frameless/TypedColumn.scala 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 576eb67...cc288aa. Read the comment docs.

@frosforever
Copy link
Contributor Author

@imarios I've removed the option and collection sorting which makes this a near copy of #231 with the slight addition of the default ascending sort order. Feel free to reject this in favor of #231 if you think it more appropriate.
Would love to hear your input on tackling Option sorting!

TypedDataset.create[T](sorted)
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frosforever, we can do something similar to what we did with orderByMany for sortWithinPartitions? Maybe even call it sortWithinPartitionsMany and have some sortWithinPartitions overloaded for up to 3 args.

*/
def desc(implicit catalystOrdered: CatalystOrdered[U]): SortedTypedColumn[T, U] =
new SortedTypedColumn[T, U](withExpr {
SortOrder(expr, Descending)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might prefer this (FramelessInternals.expr(untyped.desc)) slightly more since it uses the externally facing desc method rather than the "internal" expression SortOrder. What do you think?

@imarios
Copy link
Contributor

imarios commented Jan 30, 2018

@frosforever no way, this looks better for sure :).

the merge conflict must be quite easy to resolve. let me know if you need any help. Let me think about sorting for Options/Collections.

@imarios imarios added this to the 0.5-release milestone Jan 30, 2018
@imarios
Copy link
Contributor

imarios commented Feb 7, 2018

@frosforever I am putting my project manager hat on :D and dare to ask about fixing test coverage and merge conflicts. this is blocking 0.5 haha (let me know if you need any help, I can probably work out of your branch)

@frosforever
Copy link
Contributor Author

frosforever commented Feb 7, 2018

@imarios Gah! So sorry for the delay on this, I've been pretty busy recently. I'll do my very best to get on this by today, if not the end of this weekend. If there's a need for this sooner, I'm happy to let someone else finish it.

At the very least, I've fixed the merge conflicts.

@frosforever
Copy link
Contributor Author

@imarios fixed code coverage and added requested methods

* apache/spark
*/
def asc(implicit catalystOrdered: CatalystOrdered[U]): SortedTypedColumn[T, U] =
new SortedTypedColumn[T, U](FramelessInternals.expr(untyped.asc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one might work: new SortedTypedColumn[T, U](untyped.asc)


object SortedTypedColumn {
implicit def defaultAscending[T, U : CatalystOrdered](typedColumn: TypedColumn[T, U]): SortedTypedColumn[T, U] =
new SortedTypedColumn[T, U](new Column(SortOrder(typedColumn.expr, Ascending)))(typedColumn.uencoder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works and looks simpler:
new SortedTypedColumn[T, U](typedColumn.untyped.asc)(typedColumn.uencoder)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how frustrating. sorry. fixed.

@frosforever
Copy link
Contributor Author

@imarios done.
I highly recommend a squash merge when it's time. This PR has gone all over the place.

@imarios imarios merged commit ba9abbe into typelevel:master Feb 9, 2018
@imarios
Copy link
Contributor

imarios commented Feb 9, 2018

@frosforever great work! I love it

@frosforever frosforever deleted the sort-columns branch February 9, 2018 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants