Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True #1317

Closed
slice-pranay opened this issue Jun 2, 2023 · 13 comments · Fixed by #1960

Comments

@slice-pranay
Copy link

slice-pranay commented Jun 2, 2023

Hi Maarten

Firstly, thank you for this amazing library. I'm generating topics on newsgroups data for testing and I am using cuML for UMAP and HDBSCAN. I have set the calculate_probabilites = True and performed fit_transform() on the data. It worked fine and gave good results. When I try to run transform() on new data it gives an error AttributeError: 'tuple' object has no attribute 'shape'. When i set calculate_probabilities = False this function works fine.

The libraries i am using are
bertopic==0.15.0
cuml-cu11==23.4.1
cudf-cu11==23.4.1
cuda toolkit 11.8

I am running on a virtual ubuntu machine with Tesla T4 GPU.

The code to reproduce this error

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

train = docs[:15000]
test = docs[15000:]

umap_model = UMAP(n_components=5, n_neighbors=10, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=25, min_cluster_size=50, gen_min_span_tree=True, prediction_data = True)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, calculate_probabilities=True, verbose=True)
topics,probs = topic_model.fit_transform(train)

topics_test, probs_test = topic_model.transform(test)

The error that comes when i run this
Screenshot 2023-06-02 at 5 26 26 PM

Can you please guide me in solving this error.

@beckernick
Copy link
Contributor

beckernick commented Jun 2, 2023

Perhaps this if block might be able to use cuML's membership_vector function to align with the CPU hdbscan:

if func == "membership_vector":
if isinstance(model, hdbscan.HDBSCAN):
probabilities = hdbscan.membership_vector(model, embeddings)
return probabilities
str_type_model = str(type(model)).lower()
if "cuml" in str_type_model and "hdbscan" in str_type_model:
from cuml.cluster.hdbscan.prediction import approximate_predict
probabilities = approximate_predict(model, embeddings)
return probabilities

Or, it could perhaps be updated to reflect that approximate_predict returns a tuple of (labels, probabilities) (even if only the probabilities will be returned by the function).

predictions, probabilities = hdbscan.approximate_predict(model, embeddings)
return predictions, probabilities

@MaartenGr
Copy link
Owner

Ah, it seems indeed that the incorrect function is used there. I believe simply replacing:

from cuml.cluster.hdbscan.prediction import approximate_predict 
probabilities = approximate_predict(model, embeddings) 

with this should solve the issue:

from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings) 

I can fix this in an upcoming release. PRs are also greatly appreciated!

@slice-pranay
Copy link
Author

Thank you @MaartenGr this change alone with another change solved the problem. By just replacing the function from approximate_predict to membership_vector it gave another error

ValueError: batch_size should be in integer that is >= 0 and <= the number of prediction points

After looking into the membership_vector function in cuml.cluster.hdbscan.prediction.pyx file there is another parameter batch_size which is set to a default value of 4096. There is a check missing in that function to update this value to the size of the embeddings if its less than 4096. So adding this check in the function call itself solved this issue.

The final code that works for me is

from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings, batch_size=min(4096, len(embeddings))) 

@MaartenGr
Copy link
Owner

@slice-pranay Awesome, thanks for diving into this! If you want, it would be great if you create a PR for this. Otherwise, I can also add this in the coming weeks when I find some time. Either way, thanks for this!

@beckernick
Copy link
Contributor

beckernick commented Jun 5, 2023

Thank you @MaartenGr this change alone with another change solved the problem. By just replacing the function from approximate_predict to membership_vector it gave another error

ValueError: batch_size should be in integer that is >= 0 and <= the number of prediction points

After looking into the membership_vector function in cuml.cluster.hdbscan.prediction.pyx file there is another parameter batch_size which is set to a default value of 4096. There is a check missing in that function to update this value to the size of the embeddings if its less than 4096. So adding this check in the function call itself solved this issue.

The final code that works for me is

from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings, batch_size=min(4096, len(embeddings))) 

Thanks for surfacing this issue. When used like this, the batch_size parameter shouldn't be necessary (and shouldn't have any effect). This parameter is designed for the scenario when there is a large amount of data and users may want to potentially slightly trade off performance and higher peak memory requirements (though the default batch size of 4096 is likely the right choice as it significantly reduces peak memory requirements with a very minor impact on performance). It should be doing this under the hood, like it is already for all_points_membership_vectors.

import cuml

X, y = cuml.make_blobs(n_samples=100, n_features=3)

clf = cuml.cluster.hdbscan.HDBSCAN(prediction_data=True).fit(X)
cuml.cluster.hdbscan.all_points_membership_vectors(clf)[:5]
array([[1.0000000e+00, 4.6776744e-40, 4.0108805e-40],
       [4.9417980e-02, 5.5743980e-01, 7.2683059e-02],
       [4.8842371e-02, 7.2603232e-01, 1.0369291e-01],
       [7.5122565e-01, 5.8568917e-02, 5.3385083e-02],
       [4.5487583e-02, 1.0042124e-01, 5.8100939e-01]], dtype=float32)

I've filed a cuML issue to track this bug. In the meantime, your suggested workaround makes sense!

@beckernick
Copy link
Contributor

beckernick commented Jun 6, 2023

For completeness, this membership_vector bug has now been fixed in cuML. It won't be available in the 23.06 stable release that is about to happen, but should be available in the 23.08 nightly packages in about 1 hour is now available in the 23.08 nightly packages.

@HeadCase
Copy link

HeadCase commented Sep 8, 2023

Is this actually fixed in cuML 23.08? I have installed cuML using the instructions at https://docs.rapids.ai/install and from cuml import __version__ reports 23.08.00. Running the original poster's code example exactly as-is still produces the AttributeError: 'tuple' object has no attribute 'shape'. Is there something I am missing here?

@nilsblessing
Copy link

I'm facing the same issue with cuml 23.10.0 and BERTopic 0.16.0, is there a workaround or fix available?

@beckernick
Copy link
Contributor

beckernick commented Apr 15, 2024

As of last week, cuML 24.04 is now available. I think it's probably fair to say that almost everyone using cuML with BERTopic is using a version that supports the membership_vector function.

If there's interest and bandwidth from the maintainers to provide reviews, I'm happy to open a PR that resolves this issue and the implicitly equivalent #1764 (essentially, an updated version of this PR)

cc @MaartenGr

@MaartenGr
Copy link
Owner

@beckernick Thanks, that would be great! This has been open for way too long (which is definitely my fault!), so a PR that updates this to the membership_vector sounds good. I also intend to release a minor version of BERTopic soon with many fixes, so that would be a nice timing to have this included.

@beckernick
Copy link
Contributor

Sounds good!

@beckernick
Copy link
Contributor

Took a little longer than I'd anticipated to get hands on the keyboard, but I've opened a PR that resolves this issue.

The original example works with this PR:

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

train = docs[:15000]
test = docs[15000:]

umap_model = UMAP(n_components=5, n_neighbors=10, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=25, min_cluster_size=50, gen_min_span_tree=True, prediction_data = True)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, calculate_probabilities=True, verbose=True)
topics,probs = topic_model.fit_transform(train)

topics_test, probs_test = topic_model.transform(test)
pd.Series(topics_test).value_counts()

2024-04-30 23:29:26,528 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|█████████████████████████████████████████████████████████| 469[/469](http://localhost:8888/469) [00:14<00:00, 31.43it[/s](http://localhost:8888/s)]
2024-04-30 23:29:42,841 - BERTopic - Embedding - Completed2024-04-30 23:29:42,842 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-30 23:29:43,006 - BERTopic - Dimensionality - Completed2024-04-30 23:29:43,008 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-30 23:29:43,170 - BERTopic - Cluster - Completed2024-04-30 23:29:43,175 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-30 23:29:46,567 - BERTopic - Representation - CompletedBatches: 100%|█████████████████████████████████████████████████████████| 121[/121](http://localhost:8888/121) [00:03<00:00, 30.64it[/s](http://localhost:8888/s)]
2024-04-30 23:29:51,410 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-04-30 23:29:51,431 - BERTopic - Dimensionality - Completed2024-04-30 23:29:51,432 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-04-30 23:29:51,439 - BERTopic - Probabilities - Start calculation of probabilities with HDBSCAN
2024-04-30 23:29:51,446 - BERTopic - Probabilities - Completed2024-04-30 23:29:51,447 - BERTopic - Cluster - Completed0     1176
-1      551
 1      390
 2      362
 4      221
 3      190
 5      157
 6      155
 7      131
 8      122
 9       95
 10      66
 11      57
 12      42
 13      42
 14      40
 15      20
 17      17
 16      12
Name: count, dtype: int64

@MaartenGr
Copy link
Owner

@beckernick

Took a little longer than I'd anticipated to get hands on the keyboard, but I've opened a PR that resolves this issue.

That is all too familiar these days! So thanks for taking the time to create the PR. When it passes, I'll go ahead and merge it in preparation for a minor release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants