Skip to content

List of all the resources I developed in collaboration with LSV and Masakhane during my doctoral studies and beyond

License

Notifications You must be signed in to change notification settings

dadelani/africanlp-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 

Repository files navigation

AfricaNLP resources

List of all the resources we developed in collaboration with LSV and Masakhane during my doctoral studies and beyond

Labelled Datasets for AfricaNLP

Dataset Name NLP Task Link to Publication Languages covered
MasakhaNER named entity recognition MasakhaNER: Named Entity Recognition for African Languages amh, hau, ibo, kin, lug, luo, pcm, swa, wol, yor
MAFAND-MT machine translation A Few Thousand Translations Go a Long Way amh, bam, bbj, ewe, fon, hau, ibo, kin, lug, luo, mos, nya, pcm, sna, swa, tsn, twi, wol, xho, yor, zul
ANTC news-topic classification multilingual adaptive fine-tuning (MAFT) lin, pcm, mlg, som, zul
MENYO-20K machine translation MENYO-20k: A Multi-domain English–Yoruba Corpus for Machine Translation yor
NaijaSenti sentiment classification NaijaSenti: A Nigerian Twitter Sentiment Corpus hau, ibo, pcm, yor
Hausa and Yoruba News Topic news-topic classification Transfer Learning and Distant Supervision for Multilingual Transformer Models hau, yor
Hausa VOA NER named entity recognition Transfer Learning and Distant Supervision for Multilingual Transformer Models hau, yor
Yoruba GV NER named entity recognition Massive vs. Curated Word Embeddings for Low-Resourced Languages yor

Unlabelled Corpus for AfricaNLP

Multilingual Pre-trained Language Models

The models below are created using multilingual adaptive fine-tuning (MAFT) on XLMR-distilled model, XLM-R, mT5, ByT5 and mBART. We list the model, model size (in millions), and architecture. We cover the following 20 languages: afr, amh, ara, eng, fra, hau, ibo, mlg, nya, orm, pcm, kin, run, sna, som, sot, swa, xho, yor, zul

Model Size (M) architecture
AfroXLMR-mini 117M Masked LM
AfroXLMR-small 140M Masked LM
AfroXLMR-base 270M Masked LM
AfroXLMR-large 550M Masked LM
AfriMT5 580M Seq-to-Seq
AfriByT5 580M Seq-to-Seq
AfriMBART 610M Seq-to-Seq

Language Adaptive Fine-tuning (LAFT) Models

The following PLMs are created by language adaptation to a language using monolingual corpus in that language. The monolingual corpus used to create them are described in the MasakhaNER paper and MAFT paper

Language mBERT XLM-R-base XLM-R-large
amh Davlan/bert-base-multilingual-cased-finetuned-amharic Davlan/xlm-roberta-base-finetuned-amharic
hau Davlan/bert-base-multilingual-cased-finetuned-hausa Davlan/xlm-roberta-base-finetuned-hausa
ibo Davlan/bert-base-multilingual-cased-finetuned-igbo Davlan/xlm-roberta-base-finetuned-igbo
kin Davlan/bert-base-multilingual-cased-finetuned-kinyarwanda Davlan/xlm-roberta-base-finetuned-kinyarwanda
lin Davlan/xlm-roberta-base-finetuned-lingala
lug Davlan/bert-base-multilingual-cased-finetuned-luganda Davlan/xlm-roberta-base-finetuned-luganda
luo Davlan/bert-base-multilingual-cased-finetuned-luo Davlan/xlm-roberta-base-finetuned-luo
mlg
nya Davlan/xlm-roberta-base-finetuned-chichewa
pcm Davlan/bert-base-multilingual-cased-finetuned-naija Davlan/xlm-roberta-base-finetuned-naija
sna Davlan/xlm-roberta-base-finetuned-shona
som Davlan/xlm-roberta-base-finetuned-somali
swa Davlan/bert-base-multilingual-cased-finetuned-swahili Davlan/xlm-roberta-base-finetuned-swahili
wol Davlan/bert-base-multilingual-cased-finetuned-wolof Davlan/xlm-roberta-base-finetuned-wolof
xho Davlan/xlm-roberta-base-finetuned-xhosa
yor Davlan/bert-base-multilingual-cased-finetuned-yoruba Davlan/xlm-roberta-base-finetuned-yoruba
zul Davlan/xlm-roberta-base-finetuned-zulu

FastText Embeddings for African languages

We provide better quality word embeddings than the pre-trained FastText embeddings trained on Common crawl and Wikipedia. While we did not evaluate the quality on all the languages, our evaluation on Yoruba and Twi shows that they give better performance on word similarity tasks. The FastText embeddings are trained on curated data from JW300, Bible, VOA, BBC, and other news websites. Details of the data sources are in my PhD dissertation.

We trained the FastText embeddings using Gensim 3.8.1. All embedding models can be downloaded from Zenodo. Please, find the links below.

Language Link to Model
amh Amharic FastText
bam Bambara FastText
bbj Ghomala FastText
ewe Ewe FastText
fon Fon FastText
hau Hausa FastText
ibo Igbo FastText
kin Kinyarwanda FastText
lug Luganda FastText
luo Luo FastText
mos Mossi FastText
nya Chichewa FastText
pcm Nigerian-Pidgin FastText
sna Setswana FastText
swa Swahili FastText
tsn Setswana FastText
twi Twi FastText
wol Wolof FastText
xho Xhosa FastText
yor Yoruba FastText
zul Zulu FastText

About

List of all the resources I developed in collaboration with LSV and Masakhane during my doctoral studies and beyond

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published