Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tree walking functions #199

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

dgleich
Copy link

@dgleich dgleich commented Jul 7, 2024

This is an initial take at how to setup the tree walking codes.

This addresses #194.

I need to add more documentation still, but this should be enough to get some initial feedback before writing more docs.

@dgleich
Copy link
Author

dgleich commented Jul 9, 2024

I switched the names to ...

leafpoints, leaf_points_indices, treeindex ... I'm also thinking that using treeroot would be more consistent with these than root...

@KristofferC
Copy link
Owner

KristofferC commented Jul 9, 2024

Random thought, would it make sense to implement the https://github.com/JuliaCollections/AbstractTrees.jl interface for the trees in this package?

@dgleich
Copy link
Author

dgleich commented Jul 9, 2024

Good question. Let me think....

@dgleich
Copy link
Author

dgleich commented Jul 9, 2024

On a related note, currently, we can implement a parent function for BallTree nodes because it stores the associated regions explicitly, but I don't see an obvious way to do this for KDTree nodes without storing the min and max values for the dimension that was split in the tree.

There are some key advantages to having a parent function, e.g. then you can do tree traversal and iteration without a stack. (And I think some of the abstract trees methods would need this...)

On the other hand, for pure NearestNeighbors functions, storing this extra information is unneeded.

How amenable would you be to adding a split_minmax value to the KDTree struct that stores that information?

This would just store a tuple of values for the boundaries of the dimension that is split so they can be restored via a parent call.

src/tree_ops.jl Outdated
@@ -12,6 +12,175 @@ function show(io::IO, tree::NNTree{V}) where {V}
print(io, " Reordered: ", tree.reordered)
end

struct NNTreeNode{T <: NNTree, R}
index::Int
tree::T
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general "philosophy" that storing something big (a full KD Tree) in something that is conceptually small (a tree node) is often a mistake.

As you traverse the tree you will create all these nodes that will all contain the same tree. What do you think about dropping the tree field and instead require a user to provide the tree a an argument to the traverse functions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question and good rationale. My own experience has been that Julia is very good at optimizing the codes when the types are immutable, so I doubt it is really creating different copies if you use it in a function.

My argument for the current organization is that node ids are tied to the tree and so this makes it so that you don't have an additional argument hanging out everywhere..., it makes it easy and simple to write codes that do the right thing and get the answer right. But as I said, I hadn't considered your particular perspective here.

Is there a test we could do to resolve if this is an issue? (i.e to convince me that your perspective is correct, or for me to convince you it isn't a problem to store the tree and the compiler really is smart enough?)

Maybe, vectors of nodes would be bad for including the tree? But we do we ever actually need them?

Another argument for keeping it linked is that the AbstractTrees interface is 'node' oriented, so you define children, parent, etc. on a node level; which would require keeping the tree as part of the struct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay -- you are right -- this does make a big difference. I took a trivial walk the list and count up the sizes of the leaves code that is just going to benchmark the traversal... (total number of points 100k) By storing the NNTree variable it takes ~131 μs. If I just do it by raw calls with node ids and passing the tree as a parameter to the function, it takes ~29 μs. But... if I store a ref to the tree rather than the full tree structure, then I get all the functionality and it takes ~45 μs. I think the latter is worth doing. So I'll implement that and update the pull request. Not that all of this skips the region computations for the KDTree, so that will shorten the difference.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But... if I store a ref to the tree rather than the full tree structure

I don't fully understand what that means.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the updated structure. This stores a pointer to the tree information instead of a copy of all the information.

struct NNTreeNode{TreeRef <: Ref{ <: NNTree }, R}
     index::Int
     treeref::TreeRef
     region::R
 end 

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I have with the iterate analogy is that iterate is designed to execute within a single function context -- and has some nice syntax to hide the complexity and different types of objects -- whereas most of the tree walking functions are designed to execute recursively, where there is no such affordance that I know of. So you'll have to pass the tree structure to any subfunction -- as well as the node structure.

The current design is just designed to be easy to use; it's also feasible to adapt to the AbstractTrees.jl interface (although I haven't done that yet...) where they do the same thing with parent/children/etc. functions.

But it seems like you are still leaning against it enough though there is minimal overhead, is that correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be precise, the interface you would like is:

children(T::Tree, n::Node) -> (nl::Node, nr::Node)
parent(T::Tree, n::Node) -> (p::Node)
region(n::Node) -> 
leaf_points(T::Tree, n::Node) -> something that iterates over points in the leaf node
etc...

where node is something simple like:

struct Node{R} 
  index::Int
  region::R
end 

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick nudge on this question of interface. Would love to get this wrapped up in the next week or so before some obligations for school starts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, since I had a moment, I just implemented the interface above. As a check, we can do non-recursive exploration of the tree using the current children, parent, next/prev sibling structure, see, the e.g. points iterator...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry for the slow response here and sorry for being a bit "annoying" with trying to figure out the "best" interface to use.

A reason for this is that this is my first Julia package so it holds a bit of a special place in my heart and I have also worked quite a bit to reduce memory footprint and improve performance.

I can add your package so it is tested as part of the CI here (and you could then at any time also implement whatever tree walking interface you want there and it will not be broken, or at least it can be updated if changes are made here that would be incompatible with it).

@KristofferC
Copy link
Owner

KristofferC commented Jul 11, 2024

As a check of the functionality here it would be nice to reimplement https://github.com/KristofferC/NearestNeighbors.jl/blob/master/examples/balltree_illustration.ipynb using these official traverse functions. Doesn't strictly have to be done here but it would serve as somewhat of a use case check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants