Adding Clustering to Flair #2573

OatsProduction · 2021-12-28T11:49:34Z

This PR is the result of the Study work for @alanakbik. This PR adds 3 Clustering algorithms to flair.

k-Means
BIRCH
Expectation Maximization

flair/models/clustering/Clustering.py

flair/models/clustering/Evaluation.py

flair/models/clustering/readme.md

flair/models/clustering/run_BIRCH.py

flair/test.py

alanakbik · 2021-12-28T12:11:31Z

@OatsProduction thanks a lot for adding this! On first look-through a few points:

method names and variable names should always be "snake case" (class names are correct to be camel case)
the filenames should be lowercased, i.e. "clustering.py" instead of "Clustering.py"
we are now using black formatting, i.e. in the newest flair master branch, go to root folder and do black --config pyproject.toml flair/ && isort flair/ before pushing to the repo. Check here for more info.
some files should be removed since they only pertain to your own experiments
you can put example scripts as instructions into this issue

all in snake_case

whoisjones

Thanks a lot for adding this feature @OatsProduction. Some points regarding this PR:

Can you remove all files not related to this PR? helps reviewing it.
I have reviewed only KMeans until now, I try to support in the next days with our conventions regarding naming, required trainer's and so on.

flair/__init__.py

flair/data.py

flair/data_fetcher.py

flair/datasets/__init__.py

flair/datasets/base.py

flair/models/clustering/kmeans/k_Means.py

OatsProduction · 2021-12-31T01:08:32Z

thanks for reviewing this PR.

The current status of this PR is that :

k Means works and can be reviewed
BIRCH needs some fixing and I want to fix them before the review
EM Clustering needs also fixing

I started this PR just to start one and have this done from my ToDo list. So this is a WIP branch.
I hope this doesn't create any confusions.

better formatting

added some improvements

OatsProduction · 2022-01-08T12:47:34Z

EM Clustering is now done and functional. Can be reviewed.

BIRCH is almost done. Can also be soon reviewed.

Clustering refactorings

improved the TUTORIAL_12_CLUSTERING.md

OatsProduction · 2022-01-12T12:09:20Z

How do you think I should add the evaluation data sets needed ? Like the StackOverflow dataset ?

Also, my idea for saving/loading the model:

Saving: I will save the parameters of the clustering algorithms to a file.
Loading: I will load the parameters of the clustering algorithm from a file.

The integration of the sklearn clustering algorithms with flair is done.
Next will be the evaluation.

whoisjones · 2022-01-13T13:16:52Z

How do you think I should add the evaluation data sets needed ? Like the StackOverflow dataset ?

Also, my idea for saving/loading the model:

Saving: I will save the parameters of the clustering algorithms to a file.

Loading: I will load the parameters of the clustering algorithm from a file.

The integration of the sklearn clustering algorithms with flair is done. Next will be the evaluation.

https://scikit-learn.org/stable/modules/model_persistence.html

OatsProduction · 2022-01-13T16:02:41Z

The stackoverflow dataset comes from https://github.com/jacoxu/StackOverflow
I will build a corpus for this dataset.

OatsProduction · 2022-01-15T12:37:28Z

I added StackOverflow Corpus. But my current implementation fails.
I think the reason is, that I need a training set, not only a test set.

@whoisjones can you look at this one ?

whoisjones

@OatsProduction regarding your question - see in my reviews. There are also some other remarks :)

flair/datasets/document_classification.py

flair/models/clustering.py

resources/docs/TUTORIAL_12_CLUSTERING.md

added evaluation method improved the tutorial

alanakbik

Thanks @OatsProduction - looks good but some things:

The signatures are a bit counterintuitive. Why is the ClusteringModel instantiated with the corpus? The corpus is only needed for the fit() and evaluate() methods so it would be better to pass the model in there. So instead of

corpus = TREC_6(memory_mode='full').downsample(0.05)

model = KMeans(n_clusters=6)

clustering_model = ClusteringModel(
    model=model,
    corpus=corpus,
    label_type="question_class",
    embeddings=embeddings
)

# fit the model
clustering_model.fit()

# evaluate the model
clustering_model.evaluate()

it should be:

corpus = TREC_6(memory_mode='full').downsample(0.05)

model = KMeans(n_clusters=6)

clustering_model = ClusteringModel(
    model=model,
    label_type="question_class",
    embeddings=embeddings
)

# fit the model on a corpus
clustering_model.fit(corpus)

# evaluate the model on a corpus
clustering_model.evaluate(corpus)

The loading is counterintuitive as it requires a different clustering method to already be initialized. Better would be a static method:

    # load saved clustering model
    model = ClusteringModel.load(model_file="clustering_model.pt")
    
    # make example sentence
    sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')
    
    # predict for sentence
    model.predict(sentence)
    
    # print sentence with prediction
    print(sentence)

Small thing but it would be nice to use label names instead of numbers in the STACKOVERFLOW corpus

flair/datasets/document_classification.py

flair/models/clustering.py

better labels for corpus STACKOVERFLOW

OatsProduction · 2022-01-20T14:17:34Z

Added every remark to the code. Need another review on this PR.

alanakbik · 2022-01-26T15:05:38Z

@OatsProduction the code from the tutorial throws an error during the predict method:

from sklearn.cluster import KMeans

from flair.data import Sentence
from flair.datasets import TREC_6
from flair.embeddings import SentenceTransformerDocumentEmbeddings
from flair.models import ClusteringModel

embeddings = SentenceTransformerDocumentEmbeddings()
# store all embeddings in memory which is required to perform clustering
corpus = TREC_6(memory_mode='full').downsample(0.05)

clustering_model = ClusteringModel(model=KMeans(n_clusters=6), embeddings=embeddings)

# fit the model on a corpus
clustering_model.fit(corpus)

# save the model
clustering_model.save(model_file="clustering_model.pt")

# load saved clustering model
model = ClusteringModel.load(model_file="clustering_model.pt")

# make example sentence
sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')

# predict for sentence
model.predict(sentence)

# print sentence with prediction
print(sentence)

Can you fix it so that the tutorial code works?

Two other things:

it does not work unless the memory_mode in the corpus is set to full.
training and evaluation is performed over the full corpus. Is this standard in clustering evaluation using text classification datasets? In text classification, you train and evaluate over different splits.

OatsProduction · 2022-02-02T10:47:06Z

Some comments on the issues before:

The code in the tutorial.md works now
The memory_mode needs to be full on the STACKOVERFLOW data set. Maybe I did something wrong on the loading ? I will further investigate this.
Clustering makes only sense on a whole dataset. Training is something we don't really need. So I think that using the whole corpus makes the most sense.

alanakbik · 2022-02-04T10:43:57Z

Thanks @OatsProduction - I'll merge this and take care of the flake errors.

init

aa0f52e