Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Clustering to Flair #2573

Merged
merged 38 commits into from
Feb 4, 2022
Merged

Adding Clustering to Flair #2573

merged 38 commits into from
Feb 4, 2022

Conversation

OatsProduction
Copy link
Contributor

@OatsProduction OatsProduction commented Dec 28, 2021

This PR is the result of the Study work for @alanakbik. This PR adds 3 Clustering algorithms to flair.

  • k-Means
  • BIRCH
  • Expectation Maximization

flair/test.py Outdated Show resolved Hide resolved
@alanakbik
Copy link
Collaborator

@OatsProduction thanks a lot for adding this! On first look-through a few points:

  • method names and variable names should always be "snake case" (class names are correct to be camel case)
  • the filenames should be lowercased, i.e. "clustering.py" instead of "Clustering.py"
  • we are now using black formatting, i.e. in the newest flair master branch, go to root folder and do black --config pyproject.toml flair/ && isort flair/ before pushing to the repo. Check here for more info.
  • some files should be removed since they only pertain to your own experiments
  • you can put example scripts as instructions into this issue

Copy link
Member

@whoisjones whoisjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this feature @OatsProduction. Some points regarding this PR:

  • Can you remove all files not related to this PR? helps reviewing it.
  • I have reviewed only KMeans until now, I try to support in the next days with our conventions regarding naming, required trainer's and so on.

flair/__init__.py Outdated Show resolved Hide resolved
flair/data.py Outdated Show resolved Hide resolved
flair/data_fetcher.py Outdated Show resolved Hide resolved
flair/datasets/__init__.py Outdated Show resolved Hide resolved
flair/datasets/base.py Outdated Show resolved Hide resolved
flair/models/clustering/kmeans/k_Means.py Outdated Show resolved Hide resolved
flair/models/clustering/kmeans/k_Means.py Outdated Show resolved Hide resolved
flair/models/clustering/kmeans/k_Means.py Outdated Show resolved Hide resolved
flair/models/clustering/kmeans/k_Means.py Outdated Show resolved Hide resolved
flair/models/clustering/kmeans/k_Means.py Outdated Show resolved Hide resolved
@OatsProduction
Copy link
Contributor Author

thanks for reviewing this PR.

The current status of this PR is that :

  • k Means works and can be reviewed
  • BIRCH needs some fixing and I want to fix them before the review
  • EM Clustering needs also fixing

I started this PR just to start one and have this done from my ToDo list. So this is a WIP branch.
I hope this doesn't create any confusions.

@OatsProduction
Copy link
Contributor Author

EM Clustering is now done and functional. Can be reviewed.

BIRCH is almost done. Can also be soon reviewed.

@OatsProduction
Copy link
Contributor Author

OatsProduction commented Jan 12, 2022

How do you think I should add the evaluation data sets needed ? Like the StackOverflow dataset ?

Also, my idea for saving/loading the model:

  • Saving: I will save the parameters of the clustering algorithms to a file.
  • Loading: I will load the parameters of the clustering algorithm from a file.

The integration of the sklearn clustering algorithms with flair is done.
Next will be the evaluation.

@whoisjones
Copy link
Member

How do you think I should add the evaluation data sets needed ? Like the StackOverflow dataset ?

Also, my idea for saving/loading the model:

  • Saving: I will save the parameters of the clustering algorithms to a file.
  • Loading: I will load the parameters of the clustering algorithm from a file.

The integration of the sklearn clustering algorithms with flair is done. Next will be the evaluation.

https://scikit-learn.org/stable/modules/model_persistence.html

@OatsProduction
Copy link
Contributor Author

The stackoverflow dataset comes from https://github.com/jacoxu/StackOverflow
I will build a corpus for this dataset.

@OatsProduction
Copy link
Contributor Author

I added StackOverflow Corpus. But my current implementation fails.
I think the reason is, that I need a training set, not only a test set.

@whoisjones can you look at this one ?

Copy link
Member

@whoisjones whoisjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OatsProduction regarding your question - see in my reviews. There are also some other remarks :)

flair/datasets/document_classification.py Show resolved Hide resolved
flair/datasets/document_classification.py Outdated Show resolved Hide resolved
flair/models/clustering.py Outdated Show resolved Hide resolved
flair/models/clustering.py Outdated Show resolved Hide resolved
flair/models/clustering.py Outdated Show resolved Hide resolved
flair/models/clustering.py Outdated Show resolved Hide resolved
flair/models/clustering.py Outdated Show resolved Hide resolved
flair/models/clustering.py Outdated Show resolved Hide resolved
resources/docs/TUTORIAL_12_CLUSTERING.md Outdated Show resolved Hide resolved
added evaluation method
improved the tutorial
Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @OatsProduction - looks good but some things:

  1. The signatures are a bit counterintuitive. Why is the ClusteringModel instantiated with the corpus? The corpus is only needed for the fit() and evaluate() methods so it would be better to pass the model in there. So instead of
corpus = TREC_6(memory_mode='full').downsample(0.05)

model = KMeans(n_clusters=6)

clustering_model = ClusteringModel(
    model=model,
    corpus=corpus,
    label_type="question_class",
    embeddings=embeddings
)

# fit the model
clustering_model.fit()

# evaluate the model
clustering_model.evaluate()

it should be:

corpus = TREC_6(memory_mode='full').downsample(0.05)

model = KMeans(n_clusters=6)

clustering_model = ClusteringModel(
    model=model,
    label_type="question_class",
    embeddings=embeddings
)

# fit the model on a corpus
clustering_model.fit(corpus)

# evaluate the model on a corpus
clustering_model.evaluate(corpus)
  1. The loading is counterintuitive as it requires a different clustering method to already be initialized. Better would be a static method:
    # load saved clustering model
    model = ClusteringModel.load(model_file="clustering_model.pt")
    
    # make example sentence
    sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')
    
    # predict for sentence
    model.predict(sentence)
    
    # print sentence with prediction
    print(sentence)
  1. Small thing but it would be nice to use label names instead of numbers in the STACKOVERFLOW corpus

flair/datasets/document_classification.py Outdated Show resolved Hide resolved
flair/models/clustering.py Outdated Show resolved Hide resolved
@OatsProduction
Copy link
Contributor Author

Added every remark to the code. Need another review on this PR.

@alanakbik
Copy link
Collaborator

@OatsProduction the code from the tutorial throws an error during the predict method:

from sklearn.cluster import KMeans

from flair.data import Sentence
from flair.datasets import TREC_6
from flair.embeddings import SentenceTransformerDocumentEmbeddings
from flair.models import ClusteringModel

embeddings = SentenceTransformerDocumentEmbeddings()
# store all embeddings in memory which is required to perform clustering
corpus = TREC_6(memory_mode='full').downsample(0.05)

clustering_model = ClusteringModel(model=KMeans(n_clusters=6), embeddings=embeddings)

# fit the model on a corpus
clustering_model.fit(corpus)

# save the model
clustering_model.save(model_file="clustering_model.pt")

# load saved clustering model
model = ClusteringModel.load(model_file="clustering_model.pt")

# make example sentence
sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')

# predict for sentence
model.predict(sentence)

# print sentence with prediction
print(sentence)

Can you fix it so that the tutorial code works?

Two other things:

  • it does not work unless the memory_mode in the corpus is set to full.
  • training and evaluation is performed over the full corpus. Is this standard in clustering evaluation using text classification datasets? In text classification, you train and evaluate over different splits.

@OatsProduction
Copy link
Contributor Author

Some comments on the issues before:

  • The code in the tutorial.md works now
  • The memory_mode needs to be full on the STACKOVERFLOW data set. Maybe I did something wrong on the loading ? I will further investigate this.
  • Clustering makes only sense on a whole dataset. Training is something we don't really need. So I think that using the whole corpus makes the most sense.

@alanakbik
Copy link
Collaborator

Thanks @OatsProduction - I'll merge this and take care of the flake errors.

@alanakbik alanakbik merged commit ef2763d into flairNLP:master Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants