Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting random state for reproducibility #59

Open
psads-git opened this issue Jan 21, 2023 · 4 comments
Open

Allow setting random state for reproducibility #59

psads-git opened this issue Jan 21, 2023 · 4 comments

Comments

@psads-git
Copy link
Contributor

Dear Miles,

I have used gap-stat on a same dataset. However, the optimal number of clusters that gap-stat returns is not always the same. I guess this happens because the reference distribution is randomly generated (actually, you use numpy for that in the code). So, for reproducibility reasons, it appears reasonable to have optimalK function with an argument

random_state.

If you agree, maybe I would be able to change the code accordingly, with your directions and help.

Thanks!

@psads-git
Copy link
Contributor Author

Or even better: Letting the user select the number of Monte Carlo (“bootstrap”) samples. The reason is given in the documentation of R function clusGap:

The main result $Tab[,"gap"] of course is from bootstrapping aka Monte Carlo simulation and hence random, or equivalently, depending on the initial random seed (see set.seed()). On the other hand, in our experience, using B = 500 gives quite precise results such that the gap plot is basically unchanged after an another run.

@milesgranger
Copy link
Owner

Hi!

I suppose one could use the clusterer param to add their own callable which took a random state? But anyway, I'm open for this addition so have no strong opinions on how it ought to be done. So please feel free to open another PR and we'll see how it goes. 👍

@psads-git
Copy link
Contributor Author

Dear Miles,

One can run R from inside Python, via package rpy2. Using the same dataset, R package NbClust provides consistently the same optimal number of clusters and the same value for the gap-statistic:

# R code
library(NbClust)

res <- NbClust(data_normalized, distance = "euclidean", 
              min.nc = 2, max.nc = 10, method = "kmeans", index="gap")

print(res$Best.nc)

So, I have to study the way they do that.

Have a nice Sunday!

Paulo

@lebedov
Copy link
Contributor

lebedov commented Jul 5, 2023

Dear Miles,

I have used gap-stat on a same dataset. However, the optimal number of clusters that gap-stat returns is not always the same. I guess this happens because the reference distribution is randomly generated (actually, you use numpy for that in the code). So, for reproducibility reasons, it appears reasonable to have optimalK function with an argument

random_state.

If you agree, maybe I would be able to change the code accordingly, with your directions and help.

Thanks!

Added this functionality in #61.

milesgranger pushed a commit that referenced this issue Jul 31, 2023
* Add support for seeding RNG used for random sampling (#59).

* Tweak docstring.

* Don't use 'int | None' for random_state type hint because it only works on Py 3.10+.

* Revert change to Cargo.lock.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants