Test Time Adaptation (TTA) explores the possibility to improve a model's performaces working at test time instead of fine tuning it in a "traditional" way. That can be a really effective and helpfull practice mostly for 2 reasons:
- 💥 Fine tuning itself might be not so straight forward. It really depends on the architecture, but it can be challenging.
- 💸 Big models require non neglectable computational capacity & data to work with. (Lots of money).
Our obective is to implement a TTA solution to improve an existent image classifier.
contributors : @LuCazzola @lorenzialessandro
The backbone model of choice is Contrastive Language–Image Pre-training (CLIP), a well known model by OpenAI trained with the contrastive learning paradigma, capable of making zero-shot classification.
A possible TTA solution for CLIP as Test-Time Prompt Tuning (TPT)
For the most part we focussed on finding better alternatives to the image augmentation methods proposed in TPT :
- PreAugment
- AugMix
- AutoAugment
- DiffusionAugment
N.B.
Testing on ImageNet-A we scored :
Augmentation Technique | Avg Accuracy (%) |
---|---|
PreAugment | 27.51 |
AugMix | 28.80 |
AutoAugment | 30.36 |
DiffusionAugment | notebook |
We introduce a our approach for augmenting prompts using an image captioning system.
This method aims to create more context-aware prompts compared to the standard, generic descriptions like "a photo of a {label}" Our hypothesis is that captions specifically tailored to the content of the image will enhance the alignment between the image and the class labels, leading to improved model performance.
Accuracy on CLIP (CLIP-RN50):
Method | Avg Loss | Avg Accuracy (%) |
---|---|---|
Our Method | 3.0781 | 19.41 |
Baseline | - | 21.83 |
Accuracy on CLIP (CLIP-ViT-B/16):
Method | Avg Loss | Avg Accuracy (%) |
---|---|---|
Our Method | 2.5711 | 42.13 |
Baseline | - | 47.87 |
Results are a bit underwhelming, but there's much room for improvement! read the notebook for a better insight on our methodology.