Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Recursively handle pdist's diagonal chunks #90

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

jakirkham
Copy link
Owner

@jakirkham jakirkham commented Oct 9, 2017

Follow-up to PR ( #84 ).
Fixes #92

Make use of cdist for computing the bulk of the results for pdist. However drop all duplicate chunks on the opposite side of the diagonal. Though keep all chunks on the right side of the diagonal.

As for any chunks that straddle the diagonal, recursively break them up into smaller pieces. If the pieces are on the right side of the diagonal, they are trivially handled with cdist. If they are on or beyond the diagonal, they are trivially dropped. If they still land on the diagonal, repeat the process by calling into pdist again until they are resolved one of these two ways.

Since pdist returns its results in vector form, the recursive portion needs to make use of squareform to convert them back into square matrices that can be more easily worked with. Though the results are again unraveled according to the constraints of pdist. So the brief restructuring with squareform is a mere convenience to allow recursive calls of pdist to proceed without issues.

@jakirkham
Copy link
Owner Author

This ends up being slower than the non-recursive implementation in PR ( #84 ). Though this is likely a consequence of squareform's current implementation. Namely it does a poor job of preserving chunks (in particular along one dimension). A reworking of squareform to keep chunks together as much as possible may help.

@jakirkham jakirkham changed the title Recursively handle pdist's diagonal chunks WIP: Recursively handle pdist's diagonal chunks Oct 9, 2017
Make use of `cdist` for computing the bulk of the results for `pdist`.
However drop all duplicate chunks on the opposite side of the diagonal.
Though keep all chunks on the right side of the diagonal. As for any
chunks that land on the diagonal, break them up into pieces that include
non-duplicated pairs and pass those through `cdist` as well. Take all of
these results and flatten them so that they can be concatenated into one
big result.
Make use of `cdist` for computing the bulk of the results for `pdist`.
However drop all duplicate chunks on the opposite side of the diagonal.
Though keep all chunks on the right side of the diagonal.

As for any chunks that straddle the diagonal, recursively break them up
into smaller pieces. If the pieces are on the right side of the
diagonal, they are trivially handled with `cdist`. If they are on or
beyond the diagonal, they are trivially dropped. If they still
land on the diagonal, repeat the process by calling into `pdist` again
until they are resolved one of these two ways.

Since `pdist` returns its results in vector form, the recursive portion
needs to make use of `squareform` to convert them back into square
matrices that can be more easily worked with. Though the results are
again unraveled according to the constraints of `pdist`. So the brief
restructuring with `squareform` is a mere convenience to allow recursive
calls of `pdist` to proceed without issues.
In recursive calls to `pdist`, try to rechunk the result to match the
original chunking before it was split further. This is done in an effort
to ensure that `squareform` handles it well.
@jakirkham
Copy link
Owner Author

Even after a significant boost to squareform, in PR ( #91 ), it appears this approach was negligibly affected. Something else is slowing this approach down, but it is unclear what. Unfortunately can't explore this further ATM.

@jakirkham
Copy link
Owner Author

Have reworked this implementation to not use squareform at all with recursive pdist calls. This significantly improves the performance of this implementation relative to where it was with squareform. However it remains to slow compared to master and remains slower than the non-recursive implementation in PR ( #84 ).

Instead of using `squareform` to restructure the result in `pdist` from
each recursive call, adjust the recursive strategy to work with the
sparse `pdist` result. This should cut a significant amount of overhead
out of the recursive `pdist` diagonal optimization strategy.
Instead of explicitly setting the chunking for recursive calls to
`pdist`. Simply slice each piece and use `concatenate` to join them back
together. This has basically the same effect as rechunking, but appears
to be a little bit faster.
Combine to calls to `getitem` on the blocks that `pdist` acts on
recursively so there is only one call to `getitem`. Should make things a
bit more efficient.
The empty array was really just filler at this point. Not to mention it
doesn't make much sense now that we are using flattened results from
`pdist` and the empty array is not being flattened at all.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimizing away pdist's diagonal excess
1 participant