Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added functions for join-filtering. #202

Merged
merged 1 commit into from
Jul 26, 2016
Merged

Conversation

andrewsmartin
Copy link
Contributor

@andrewsmartin andrewsmartin commented Jul 24, 2016

@nevillelyh @rav RFC please. I often see code to filter an SCollection[(K, V)] by another one of type SCollection[K], where (k, v) pairs are filtered out if k is not present on the right hand side. It usually looks like this:

val s1: SCollection[(K, V)] = getIt()
val s2: SCollection[K] = getThat()

val filtered: SCollection[(K, V)] = 
    s1.join(s2.distinct.map(_, ())).map{ case(k, (v, _)) => (k, v)}

I find this pattern doesn't really speak to the high level intent of the operation, and personally it always takes me a bit of staring to understand what's happening when I see it. So I thought it might be useful to include in the core API. The equivalent of the above would look like this:

val s1: SCollection[(K, V)] = getIt()
val s2: SCollection[K] = getThat()

val filtered: SCollection[(K, V)] = s1.filterJoin(s2)

Do you consider this useful at all? Is there a more obvious way of doing this that I'm not thinking about? If you do think it's useful, any suggestions on naming? filterJoin is the first thing that came to mind since I'm implementing this on top of join but there's probably a better name. Similar functionality is achieved in a plain SCollection[K] via intersect, which internally converts both collections to a dummy SCollection[(K, V)], and inner join is the equivalent for SCollection[(K, V)], SCollection[(K, W)]. Maybe keyIntersect is a better name.

@andrewsmartin andrewsmartin force-pushed the andrew/filter-join branch 3 times, most recently from 992adc7 to c3f53c4 Compare July 24, 2016 01:29
@codecov-io
Copy link

codecov-io commented Jul 24, 2016

Current coverage is 75.26% (diff: 100%)

Merging #202 into master will increase coverage by 0.03%

@@             master       #202   diff @@
==========================================
  Files            61         61          
  Lines          2160       2163     +3   
  Methods        1998       1977    -21   
  Messages          0          0          
  Branches        162        186    +24   
==========================================
+ Hits           1625       1628     +3   
  Misses          535        535          
  Partials          0          0          

Powered by Codecov. Last update 72b4a31...3da9919

@ravwojdyla
Copy link
Contributor

ravwojdyla commented Jul 25, 2016

Would not use join in the name (the implementation should most likely use cogroup instead of joins), some variation of intersect (intersectByKey) makes more sense.

}
}

it should "support intersectByKey() with duplicate keys" in {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to test on empty left&right hand side?

@ravwojdyla
Copy link
Contributor

+1 pls squash commits

@nevillelyh nevillelyh merged commit a429a91 into master Jul 26, 2016
nevillelyh pushed a commit that referenced this pull request Jul 26, 2016
@nevillelyh nevillelyh deleted the andrew/filter-join branch July 29, 2016 14:45
nevillelyh pushed a commit that referenced this pull request Oct 21, 2016
nevillelyh pushed a commit that referenced this pull request Jan 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants