Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c_tf_idf_.indptr is None when attempting to save merged model #2020

Closed
ddicato opened this issue May 29, 2024 · 6 comments · Fixed by #2112
Closed

c_tf_idf_.indptr is None when attempting to save merged model #2020

ddicato opened this issue May 29, 2024 · 6 comments · Fixed by #2112
Labels
good first issue Good for newcomers

Comments

@ddicato
Copy link
Contributor

ddicato commented May 29, 2024

After using BERTopic.merge_models, I am unable to save the resulting model. Here is some repro code:

import bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import PCA

print(f'Using bertopic version {bertopic.__version__}')

embedding_model = 'all-mpnet-base-v2'

def new_topic_model():
    return BERTopic(
        embedding_model=embedding_model,
        # Using PCA because UMAP segfaults on my machine
        umap_model=PCA(n_components=10, random_state=40),
    )

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
docs_a = docs[:400]
docs_b = docs[400:500]

topic_model_a = new_topic_model()
topic_model_a.fit(docs_a)
# No problem saving topic_model_a
topic_model_a.save(
    './topic_model_a',
    serialization='safetensors',
    save_embedding_model=embedding_model,
    save_ctfidf=True,
)

topic_model_b = new_topic_model()
topic_model_b.fit(docs_b)
# No problem saving topic_model_b
topic_model_b.save(
    './topic_model_b',
    serialization='safetensors',
    save_embedding_model=embedding_model,
    save_ctfidf=True,
)

topic_model = BERTopic.merge_models([topic_model_a, topic_model_b])
# Problem here: AttributeError: 'NoneType' object has no attribute 'indptr'
topic_model.save(
    './topic_model',
    serialization='safetensors',
    save_embedding_model=embedding_model,
    save_ctfidf=True,
)

I get the following output:

% python3 repro.py
Using bertopic version 0.16.2
Traceback (most recent call last):
  File "[...]/BERTopic/repro.py", line 43, in <module>
    topic_model.save(
  File "[...]/BERTopic/bertopic/_bertopic.py", line 3112, in save
    save_utils.save_ctfidf(model=self, save_directory=save_directory, serialization=serialization)
  File "[...]/BERTopic/bertopic/_save_utils.py", line 318, in save_ctfidf
    indptr = torch.from_numpy(model.c_tf_idf_.indptr)
                              ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'indptr'
@ddicato
Copy link
Contributor Author

ddicato commented May 29, 2024

I notice that this is only a problem if I run with save_ctfidf=True. When save_ctfidf=False, this code executes normally. Is it supported to save ctfidf state for a merged model?

@MaartenGr
Copy link
Owner

Ah, I think that's because .merge_models does not support c-TF-IDF at the moment since both models have different vocabularies and representations that might be tricky to merge. So unfortunately this is not supported at the moment.

@ddicato
Copy link
Contributor Author

ddicato commented Jun 5, 2024

In that case, the question is: How do we handle saving models when save_ctfidf=True and there is no c-tf-idf state?

I would propose that the fix here is either:

  1. Silently ignore missing c-tf-idf data and save the rest of the model
  2. Raise an exception with better error message, e.g. "Unable to save model with missing c-TF-IDF data, perhaps because the model was merged. Re-try with save_ctfidf=False."

@MaartenGr Any preference between (1) and (2)?

@MaartenGr
Copy link
Owner

@ddicato I would actually suggest a third solution. Instead of raising an exception, raise a log a warning instead that mentions that the c-TF-IDF representation cannot be saved since .merge_models was used (or something that does not relate to .merge_models). And then still save the rest of the model. That way, users do not have to change the parameter which requires a step less for time whilst still being informed about what happened.

As a side note, I have thought about trying to merge the c-TF-IDF representations from both models since they generally have the same underlying c-TF-IDF parameters. It seldom happens that users use different CountVectorizers for the models they merge. It would, however, generally require access to the Bag-of-Words representation (which is not saved) and not the c-TF-IDF representation (which is saved). Hmm, something to think about...

@ddicato
Copy link
Contributor Author

ddicato commented Jun 6, 2024

I like your third solution better. Users get a more useful error message with minimal disruption.

That's an interesting side note. Would this issue be the right place to continue the discussion? #1878

@MaartenGr MaartenGr added the good first issue Good for newcomers label Jun 8, 2024
@MaartenGr
Copy link
Owner

I like your third solution better. Users get a more useful error message with minimal disruption.

Good to hear! I will put it on the backlog for me to do, but if you or anyone else wants to work on this then that would be highly appreciated.

That's an interesting side note. Would this issue be the right place to continue the discussion? #1878

Ah right, that might be a nice place to indeed continue that discussion.

@sete39 sete39 mentioned this issue Aug 4, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants