-
Notifications
You must be signed in to change notification settings - Fork 124
Add HDBSCAN from HorseML.jl #273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
MommaWatasu
wants to merge
47
commits into
JuliaStats:master
Choose a base branch
from
MommaWatasu:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+391
−0
Open
Changes from all commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
23bbed1
[add function or file]add hdbscan
MommaWatasu fa44398
[test]add test for hdbscan
MommaWatasu 4e64fdf
[fix]change Int64 to Int
MommaWatasu e294565
[fix]change all Int64 into Int
MommaWatasu b851997
[change]change usage and remove extra code
MommaWatasu 7822a7c
[test]update test
MommaWatasu 8901cfe
[docs]add docs for HDBSCAN
MommaWatasu 6de5d02
[fix]export HdbscanCluster
MommaWatasu 85c5644
[docs]merge docs of HDBSCAN with DBSCAN.md
MommaWatasu edcc70a
[clean]refactoring HDBSCANGraph
MommaWatasu 939ce65
[docs]fix docs
MommaWatasu 2f67d07
[clean]refactoring
MommaWatasu 61463f2
[clean]refactoring
MommaWatasu 09ed174
[test]update test
MommaWatasu 6039c0c
[fix]change isnothing into ===
MommaWatasu 3cc7689
[fix]remove println
MommaWatasu 1798148
[docs]update docs
MommaWatasu a0a819e
[docs]fix docs
MommaWatasu bf38eb6
[fix]fix docstring
MommaWatasu 380acf1
[fix]add isnoise to the list of exprted function
MommaWatasu 6fdebe6
core_distances() tweaks
alyst ff98806
Hdbscan: simplify results generation
alyst 64a202a
remove heappush!: unused
alyst 661edbc
hdbscan test: small tweaks
alyst 082c5e6
fixup hdbscan assignments
alyst c9db368
hdbscan: further opt core_dists
alyst c205575
hdbscan: optimize edges generation
alyst bdf1aec
HDBSCANGraph -> HdbscanGraph
alyst 9aa9841
HdbscanEdge
alyst c9b9374
HdbscanGraph: rename edges to adj_edges
alyst 092ac40
MSTEdge remove no-op expand() method
alyst c972caa
refactor HdbscanMSTEdge
alyst 57b05e6
hdbscan: fix graph vertices, remove add_edges!
alyst f616795
hdbscan_minspantree(): refactor
alyst 0f6e992
prune_clusters!(): cleanup
alyst e01609b
hdbscan: fix 1.0 compat
alyst de6b83a
[docs]fix docstring
MommaWatasu 38212ca
[clean]rename and refactoring
MommaWatasu 159858d
[test]add test for unionfind
MommaWatasu a03c224
hdbscan_minspantree: fix edges sorting
alyst 3a577ea
hdbscan_clusters(): fix cost type
alyst 744af22
hdbscan_clusters: improve MST iteration
alyst 8803573
Merge branch 'master' of https://github.com/MommaWatasu/Clustering.jl
MommaWatasu b94de25
[clean]rename the result structure
MommaWatasu 7078223
[hotfix]apply hot fix
MommaWatasu b0791d9
[test]add test about min_size
MommaWatasu dc5dd40
[add function or file]support ClusteringResult
MommaWatasu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,232 @@ | ||
# HDBSCAN Graph edge: target vertex and mutual reachability distance. | ||
HdbscanEdge = Tuple{Int, Float64} | ||
|
||
# HDBSCAN Graph | ||
struct HdbscanGraph | ||
adj_edges::Vector{Vector{HdbscanEdge}} | ||
|
||
HdbscanGraph(nv::Integer) = new([HdbscanEdge[] for _ in 1 : nv]) | ||
end | ||
|
||
# Edge of the minimum spanning tree for HDBScan algorithm | ||
HdbscanMSTEdge = NamedTuple{(:v1, :v2, :dist), Tuple{Int, Int, Float64}} | ||
|
||
mutable struct HdbscanNode | ||
parent::Int | ||
children::Vector{Int} | ||
points::Vector{Int} | ||
λp::Vector{Float64} | ||
stability::Float64 | ||
children_stability::Float64 | ||
|
||
function HdbscanNode(points::Vector{Int}; λp::Vector{Float64}=Float64[], children::Vector{Int}=Int[], children_stability::Float64=0.0) | ||
noise = isempty(λp) | ||
return new(0, children, points, λp, noise ? -1 : 0, children_stability) | ||
end | ||
end | ||
|
||
cluster_len(c::HdbscanNode) = length(c.points) | ||
|
||
""" | ||
HdbscanCluster | ||
|
||
A cluster generated by the [hdbscan](@ref) method, part of [HdbscanResult](@ref). | ||
- `points`: vector of points which belongs to the cluster | ||
- `stability`: stability of the cluster (-1 for noise clusters) | ||
|
||
The stability represents how much the cluster is "reasonable". So, a cluster which has a bigger stability is better. | ||
The noise cluster is determined not to belong to any cluster. So, you can ignore them. | ||
See also: [isnoise](@ref) | ||
""" | ||
struct HdbscanCluster | ||
points::Vector{Int} | ||
stability::Float64 | ||
end | ||
|
||
""" | ||
isnoise(c::HdbscanCluster) | ||
|
||
This function returns whether the cluster is the noise or not. | ||
""" | ||
isnoise(c::HdbscanCluster) = c.stability == -1 | ||
isnoise(c::HdbscanNode) = c.stability == -1 | ||
isstable(c::HdbscanNode) = c.stability != 0 | ||
function compute_stability!(c::HdbscanNode, λbirth) | ||
c.stability = sum(c.λp) - length(c.λp) * λbirth | ||
return c.stability | ||
end | ||
|
||
""" | ||
HdbscanResult | ||
|
||
Result of the [hdbscan](@ref) clustering. | ||
- `clusters`: vector of clusters | ||
- `assignments`: vectors of assignments for each points | ||
|
||
See also: [HdbscanCluster](@ref) | ||
""" | ||
struct HdbscanResult <: ClusteringResult | ||
clusters::Vector{HdbscanCluster} | ||
assignments::Vector{Int} | ||
counts::Vector{Int} | ||
end | ||
|
||
""" | ||
hdbscan(points::AbstractMatrix, ncore::Integer, min_cluster_size::Integer; | ||
metric=SqEuclidean()) | ||
|
||
Cluster `points` using Density-Based Clustering Based on Hierarchical Density Estimates (HDBSCAN) algorithm. | ||
Refer to [HDBSCAN algorithm](@ref hdbscan_algorithm) description for the details on how the algorithm works. | ||
# Parameters | ||
- `points`: the *d*×*n* matrix, where each column is a *d*-dimensional point | ||
- `ncore::Integer`: number of *core* neighbors of point, see [HDBSCAN algorithm](@ref hdbscan_algorithm) for a description | ||
- `min_cluster_size::Integer`: minimum number of points in the cluster | ||
- `metric`(defaults to Euclidean): the points distance metric to use. | ||
""" | ||
function hdbscan(points::AbstractMatrix, ncore::Integer, min_cluster_size::Int; metric=Euclidean()) | ||
if min_cluster_size < 1 | ||
throw(DomainError(min_cluster_size, "The `min_cluster_size` must be greater than or equal to 1")) | ||
end | ||
n = size(points, 2) | ||
dists = pairwise(metric, points; dims=2) | ||
# calculate core (ncore-th nearest) distance for each point | ||
core_dists = [partialsort!(dists[:, i], ncore) for i in axes(dists, 2)] | ||
|
||
#calculate mutual reachability distance between any two points | ||
mrd = hdbscan_graph(core_dists, dists) | ||
#compute a minimum spanning tree by prim method | ||
mst = hdbscan_minspantree(mrd) | ||
#build a HDBSCAN hierarchy | ||
tree = hdbscan_build_tree(mst, min_cluster_size) | ||
#extract the target cluster | ||
extract_clusters!(tree) | ||
#generate the list of cluster assignment for each point | ||
clusters = HdbscanCluster[] | ||
counts = Int[] | ||
assignments = fill(0, n) # cluster index of each point | ||
for (i, j) in enumerate(tree[end].children) | ||
clu = tree[j] | ||
push!(clusters, HdbscanCluster(clu.points, clu.stability)) | ||
push!(counts, length(clu.points)) | ||
assignments[clu.points] .= i | ||
end | ||
# add the cluster of all unassigned (noise) points | ||
noise_points = findall(==(0), assignments) | ||
isempty(noise_points) || push!(clusters, HdbscanCluster(noise_points, -1)) | ||
return HdbscanResult(clusters, assignments, counts) | ||
end | ||
|
||
function hdbscan_graph(core_dists::AbstractVector, dists::AbstractMatrix) | ||
n = size(dists, 1) | ||
graph = HdbscanGraph(n) | ||
for i in axes(dists, 2) | ||
i_dists = view(dists, :, i) | ||
i_core = core_dists[i] | ||
for j in i+1:n | ||
c = max(i_core, core_dists[j], i_dists[j]) | ||
# add reciprocal edges | ||
push!(graph.adj_edges[i], (j, c)) | ||
push!(graph.adj_edges[j], (i, c)) | ||
end | ||
end | ||
return graph | ||
end | ||
|
||
function hdbscan_minspantree(graph::HdbscanGraph) | ||
# put the edge to the heap (sorted from largest to smallest distances) | ||
function heapput!(h, v::HdbscanMSTEdge) | ||
idx = searchsortedlast(h, v, by=e -> e.dist, rev=true) | ||
insert!(h, idx + 1, v) | ||
end | ||
|
||
# initialize the edges heap by putting all edges of the first node (the root) | ||
heap = HdbscanMSTEdge[] | ||
for (i, c) in graph.adj_edges[1] | ||
heapput!(heap, (v1=1, v2=i, dist=c)) | ||
end | ||
@assert issorted(heap, by=e -> e.dist, rev=true) | ||
|
||
# build the tree | ||
n = length(graph.adj_edges) | ||
minspantree = Vector{HdbscanMSTEdge}() | ||
inmst = falses(n) # if the graph node is in MST | ||
inmst[1] = true # root | ||
|
||
while length(minspantree) < n-1 | ||
# get the edge with the smallest distance from the heap | ||
(i, j, c) = pop!(heap) | ||
|
||
# add j-th node to MST if not there | ||
inmst[j] && continue | ||
push!(minspantree, (v1=i, v2=j, dist=c)) | ||
inmst[j] = true | ||
|
||
# add adjacent edges of j to the heap | ||
for (k, c) in graph.adj_edges[j] | ||
inmst[k] || heapput!(heap, (v1=j, v2=k, dist=c)) | ||
end | ||
end | ||
return sort!(minspantree, by=e -> e.dist) | ||
end | ||
|
||
function hdbscan_build_tree(minspantree::AbstractVector{HdbscanMSTEdge}, min_size::Integer) | ||
n = length(minspantree) + 1 | ||
cost = 0.0 | ||
uf = UnionFind(n) | ||
clusters = [HdbscanNode(Int[i], λp=(min_size==1) ? Float64[Inf] : Float64[]) for i in 1:n] | ||
|
||
for (i, (j, k, c)) in enumerate(minspantree) | ||
cost += c | ||
λ = 1 / cost | ||
#child clusters | ||
c1 = set_id(uf, j) | ||
c2 = set_id(uf, k) | ||
#reference to the parent cluster | ||
clusters[c1].parent = clusters[c2].parent = n+i | ||
#unite the clusters | ||
unite!(uf, j, k) | ||
points = items(uf, set_id(uf, j)) | ||
if length(points) < min_size | ||
push!(clusters, HdbscanNode(points)) | ||
else | ||
nc1, nc2 = isnoise(clusters[c1]), isnoise(clusters[c2]) | ||
if nc1 && nc2 | ||
push!(clusters, HdbscanNode(points, λp=fill(λ, length(points)))) | ||
elseif nc1 || nc2 | ||
#ensure that c1 is the cluster | ||
if nc1 == true | ||
c1, c2 = c2, c1 | ||
end | ||
push!(clusters, HdbscanNode(points, λp=[clusters[c1].λp..., fill(λ, cluster_len(clusters[c2]))...])) | ||
else | ||
#compute stabilities for children | ||
c1_stability = compute_stability!(clusters[c1], λ) | ||
c2_stability = compute_stability!(clusters[c2], λ) | ||
#unite the two clusters | ||
push!(clusters, HdbscanNode(points, λp=fill(λ, length(points)), children=[c1, c2], children_stability=c1_stability+c2_stability)) | ||
end | ||
end | ||
end | ||
compute_stability!(clusters[end], 1/cost) | ||
@assert length(clusters) == 2n - 1 | ||
return clusters | ||
end | ||
|
||
function extract_clusters!(tree::Vector{HdbscanNode}) | ||
for i in eachindex(tree) | ||
i == length(tree) && continue | ||
c = tree[i] | ||
isempty(c.children) && continue | ||
parent = tree[c.parent] | ||
children_stability = sum([tree[c.children[i]].stability for i in eachindex(c.children)]) | ||
if children_stability > c.stability | ||
filter!(x->x!=i, parent.children) | ||
append!(parent.children, c.children) | ||
end | ||
end | ||
c = tree[end] | ||
children_stability = sum([tree[c.children[i]].stability for i in eachindex(c.children)]) | ||
if children_stability <= c.stability | ||
c.children = Int[2 * length(tree) - 1] | ||
end | ||
end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Union-Find | ||
# structure for managing disjoint sets | ||
# This structure tracks which sets the elements of a set belong to, | ||
# and allows us to efficiently determine whether two elements belong to the same set. | ||
mutable struct UnionFind | ||
parent::Vector{Int} # parent[root] is the negative of the size | ||
label::Dict{Int, Int} | ||
next_id::Int | ||
|
||
function UnionFind(nodes::Int) | ||
if nodes <= 0 | ||
throw(ArgumentError("invalid argument for nodes: $nodes")) | ||
end | ||
|
||
parent = -ones(nodes) | ||
label = Dict([i=>i for i in 1:nodes]) | ||
new(parent, label, nodes) | ||
end | ||
end | ||
|
||
# label of the set which element `x` belong to | ||
set_id(uf::UnionFind, x::Int) = uf.label[root(uf, x)] | ||
# all elements that have the specified label | ||
items(uf::UnionFind, x::Int) = [k for (k, v) in pairs(uf.label) if v == x] | ||
|
||
# root of element `x` | ||
# The root is the representative element of the set | ||
function root(uf::UnionFind, x::Int) | ||
if uf.parent[x] < 0 | ||
return x | ||
else | ||
return uf.parent[x] = root(uf, uf.parent[x]) | ||
end | ||
end | ||
|
||
function setsize(uf::UnionFind, x::Int) | ||
return -uf.parent[root(uf, x)] | ||
end | ||
|
||
function unite!(uf::UnionFind, x::Int, y::Int) | ||
x = root(uf, x) | ||
y = root(uf, y) | ||
if x == y | ||
return false | ||
end | ||
if uf.parent[x] > uf.parent[y] | ||
x, y = y, x | ||
end | ||
# unite smaller tree(y) to bigger one(x) | ||
uf.parent[x] += uf.parent[y] | ||
uf.parent[y] = x | ||
uf.next_id += 1 | ||
uf.label[y] = uf.next_id | ||
for i in items(uf, set_id(uf, x)) | ||
uf.label[i] = uf.next_id | ||
end | ||
return true | ||
end |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.