WIP: Recursively handle pdist's diagonal chunks #90

jakirkham · 2017-10-09T05:24:13Z

Follow-up to PR ( #84 ).
Fixes #92

Make use of cdist for computing the bulk of the results for pdist. However drop all duplicate chunks on the opposite side of the diagonal. Though keep all chunks on the right side of the diagonal.

As for any chunks that straddle the diagonal, recursively break them up into smaller pieces. If the pieces are on the right side of the diagonal, they are trivially handled with cdist. If they are on or beyond the diagonal, they are trivially dropped. If they still land on the diagonal, repeat the process by calling into pdist again until they are resolved one of these two ways.

Since pdist returns its results in vector form, the recursive portion needs to make use of squareform to convert them back into square matrices that can be more easily worked with. Though the results are again unraveled according to the constraints of pdist. So the brief restructuring with squareform is a mere convenience to allow recursive calls of pdist to proceed without issues.

jakirkham · 2017-10-09T05:29:48Z

This ends up being slower than the non-recursive implementation in PR ( #84 ). Though this is likely a consequence of squareform's current implementation. Namely it does a poor job of preserving chunks (in particular along one dimension). A reworking of squareform to keep chunks together as much as possible may help.

Make use of `cdist` for computing the bulk of the results for `pdist`. However drop all duplicate chunks on the opposite side of the diagonal. Though keep all chunks on the right side of the diagonal. As for any chunks that land on the diagonal, break them up into pieces that include non-duplicated pairs and pass those through `cdist` as well. Take all of these results and flatten them so that they can be concatenated into one big result.

Make use of `cdist` for computing the bulk of the results for `pdist`. However drop all duplicate chunks on the opposite side of the diagonal. Though keep all chunks on the right side of the diagonal. As for any chunks that straddle the diagonal, recursively break them up into smaller pieces. If the pieces are on the right side of the diagonal, they are trivially handled with `cdist`. If they are on or beyond the diagonal, they are trivially dropped. If they still land on the diagonal, repeat the process by calling into `pdist` again until they are resolved one of these two ways. Since `pdist` returns its results in vector form, the recursive portion needs to make use of `squareform` to convert them back into square matrices that can be more easily worked with. Though the results are again unraveled according to the constraints of `pdist`. So the brief restructuring with `squareform` is a mere convenience to allow recursive calls of `pdist` to proceed without issues.

In recursive calls to `pdist`, try to rechunk the result to match the original chunking before it was split further. This is done in an effort to ensure that `squareform` handles it well.

This reverts commit 1049849.

jakirkham · 2017-10-09T20:08:26Z

Even after a significant boost to squareform, in PR ( #91 ), it appears this approach was negligibly affected. Something else is slowing this approach down, but it is unclear what. Unfortunately can't explore this further ATM.

jakirkham · 2017-10-10T19:57:11Z

Have reworked this implementation to not use squareform at all with recursive pdist calls. This significantly improves the performance of this implementation relative to where it was with squareform. However it remains to slow compared to master and remains slower than the non-recursive implementation in PR ( #84 ).

Instead of using `squareform` to restructure the result in `pdist` from each recursive call, adjust the recursive strategy to work with the sparse `pdist` result. This should cut a significant amount of overhead out of the recursive `pdist` diagonal optimization strategy.

Instead of explicitly setting the chunking for recursive calls to `pdist`. Simply slice each piece and use `concatenate` to join them back together. This has basically the same effect as rechunking, but appears to be a little bit faster.

Combine to calls to `getitem` on the blocks that `pdist` acts on recursively so there is only one call to `getitem`. Should make things a bit more efficient.

The empty array was really just filler at this point. Not to mention it doesn't make much sense now that we are using flattened results from `pdist` and the empty array is not being flattened at all.

jakirkham mentioned this pull request Oct 9, 2017

WIP: Optimize pdist's handling of diagonal chunks #84

Open

jakirkham changed the title ~~Recursively handle pdist's diagonal chunks~~ WIP: Recursively handle pdist's diagonal chunks Oct 9, 2017

jakirkham force-pushed the opt_gen_pdist_diag_recrsv branch from 0af1da3 to c31a83a Compare October 9, 2017 15:48

jakirkham added 4 commits October 9, 2017 11:48

Try to restore previous chunking for squareform

1049849

In recursive calls to `pdist`, try to rechunk the result to match the original chunking before it was split further. This is done in an effort to ensure that `squareform` handles it well.

Revert "Try to restore previous chunking for squareform"

ae0b43f

This reverts commit 1049849.

jakirkham force-pushed the opt_gen_pdist_diag_recrsv branch from a759f0f to ae0b43f Compare October 9, 2017 20:04

jakirkham mentioned this pull request Oct 9, 2017

Optimizing away pdist's diagonal excess #92

Open

jakirkham force-pushed the opt_gen_pdist_diag_recrsv branch from fbc27fd to ace1e58 Compare October 10, 2017 20:05

jakirkham force-pushed the opt_gen_pdist_diag_recrsv branch from ace1e58 to 72cfba4 Compare October 10, 2017 20:07

jakirkham added 4 commits October 10, 2017 16:15

Use concatenate to break chunks for recursion

f607865

Instead of explicitly setting the chunking for recursive calls to `pdist`. Simply slice each piece and use `concatenate` to join them back together. This has basically the same effect as rechunking, but appears to be a little bit faster.

Slice pdist recursive blocks once

a83f02e

Combine to calls to `getitem` on the blocks that `pdist` acts on recursively so there is only one call to `getitem`. Should make things a bit more efficient.

Unwrap some short lines in pdist

9eb5679

Drop unused empty array

d2408ee

The empty array was really just filler at this point. Not to mention it doesn't make much sense now that we are using flattened results from `pdist` and the empty array is not being flattened at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Recursively handle pdist's diagonal chunks #90

WIP: Recursively handle pdist's diagonal chunks #90

jakirkham commented Oct 9, 2017 •

edited

Loading

jakirkham commented Oct 9, 2017

jakirkham commented Oct 9, 2017

jakirkham commented Oct 10, 2017

WIP: Recursively handle pdist's diagonal chunks #90

Are you sure you want to change the base?

WIP: Recursively handle pdist's diagonal chunks #90

Conversation

jakirkham commented Oct 9, 2017 • edited Loading

jakirkham commented Oct 9, 2017

jakirkham commented Oct 9, 2017

jakirkham commented Oct 10, 2017

jakirkham commented Oct 9, 2017 •

edited

Loading