Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seralization of embeddings #3011

Merged
merged 22 commits into from
Jan 25, 2023

Conversation

helpmefindaname
Copy link
Collaborator

@helpmefindaname helpmefindaname commented Dec 5, 2022

To be saver in regards of pickle, I propose to use a dict-format to store all properties required to recreate the embeddings (weights are stored with the model itself anyways).
This allows opening Flairmodels with incompatible parameters via torch.load(...) and therefore allows debugging version conflicts.

During development I also found & fixed the following issues:

  • DocumentLMEmbeddings were not providing the right names for their embeddings. So taking the correct usage of doc_lm_embedding.embedd(sentence);sentence.get_embeddings(doc_lm_embedding.get_names()) Would result into an empty tensor
  • Frozen FlairEmbeddings always use dropout: since the .train() method didn't call it's super method, the .eval() call in the __init__ was negated, leading to dropout staying enabled as that is the default.
  • Some Embeddings where not in .eval() mode after creating.
  • Add tests to embeddings that have no tests yet.
  • HashEmbeddings were returning an index error, unless each sentence had exactly one token (indexing unflattened array)
  • ElmoEmbeddings are deprecated as allennlp will stop support soon.
  • TextRegression model is now rightfully importable as from flair.models import TextRegressior

This also implements two classes AutoFlairModel and AutoFlairClassifier which can be used to to load any model, given that their type is clear.
Example usages are here:

from flair import AutoFlairClassifier
tagger = AutoFlairClassifier.load("ner-large")
tars = AutoFlairClassifier.load("tars-tagger")
tars.save("model.pt")
tars2 = AutoFlairClassifier.load("model.pt")
relation = AutoFlairClassifier.load("relation")
offensive = AutoFlairClassifier.load("de-offensive-language")
multi = MultitaskModel([offensive, tagger, relation])
multi.save("multi.pt")
multi2 = AutoFlairClassifier.load("multi.pt")
...

The difference between AutoFlairModel and AutoFlairClassifier is that AutoFlairClassifier is limited to only classifers (no text-regressor) while it provides stronger typing hints (all methods the Classifier provides extra, e.g.: predict)

Potential issues are:

  • current models do not contain class information. I added a simple method that tries to parse the content into the right class but that might fail. E.g. before changing the code from model = SequenceTagger.load("my-model.pt") to model = AutoFlairClassifier.load("my-model.pt") I would recommend loading it once and saving it again on the newest version.
  • MultitaskModel got a rework of the internal state, therefore older models cannot be loaded. I don't see this as an big issue as those were never released before. But it is something to be aware of.

@helpmefindaname
Copy link
Collaborator Author

helpmefindaname commented Dec 12, 2022

Status of Embeddings:

  • token.py
    • TransformerWordEmbeddings
    • StackedEmbeddings
    • WordEmbeddings
    • CharacterEmbeddings
    • FlairEmbeddings
    • PooledFlairEmbeddings
    • FastTextEmbeddings
    • OneHotEmbeddings
    • HashEmbeddings
    • MuseCrosslingualEmbeddings
    • BytePairEmbeddings
    • NILCEmbeddings
  • document.py
    • TransformerDocumentEmbeddings
    • DocumentPoolEmbeddings
    • DocumentTFIDFEmbeddings
    • DocumentRNNEmbeddings
    • DocumentLMEmbeddings
    • SentenceTransformerDocumentEmbeddings
    • DocumentCNNEmbeddings
  • transformer.py
    • TransformerOnnxEmbeddings
    • TransformerJitEmbeddings
    • TransformerJitWordEmbeddings
    • TransformerJitDocumentEmbeddings
    • TransformerOnnxWordEmbeddings
    • TransformerOnnxDocumentEmbeddings
    • TransformerEmbeddings
  • image.py
    • IdentityImageEmbeddings
    • PrecomputedImageEmbeddings
    • NetworkImageEmbeddings
    • ConvTransformNetworkImageEmbeddings

@helpmefindaname helpmefindaname changed the title WIP: Seralization of embeddings Seralization of embeddings Dec 19, 2022
@alanakbik
Copy link
Collaborator

Hello @helpmefindaname this is really cool, thanks for creating this!

Some initial thoughts for discussion:

  • I wonder if AutoFlairClassifier and Classifier can/should be merged into a single class: for instance Classifier could be renamed to FlairClassifier and the auto load logic added directly here. It would make the logic less distributed and the syntax (slightly) more succinct for end-users. i.e. load any flair model with:
model = FlairClassifier.load("ner")
  • I also wonder if a convenience method for loading "pipelines" could be added. For instance, if users do
model = FlairClassifier.load("ner", "pos", "relations")

it would load a whole pipeline that when calling model.predict() would annotate ner, pos and relation information on a sentence.

from flair.data import Dictionary
from flair.nn.recurrent import create_recurrent_layer


@AutoFlairModel.register
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the LanguageModel registered as AutoFlairModel?

@alanakbik
Copy link
Collaborator

Thanks again for improving this @helpmefindaname! Regarding our discussion on whether/how to merge ModelRegisterMixin into the Model abstract base class I'll check if I can find any good way to do this.

@alanakbik alanakbik merged commit dbc1569 into flairNLP:master Jan 25, 2023
@helpmefindaname helpmefindaname deleted the seralize_embeddings branch January 26, 2023 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants