Official implementation of
- Decomposed Temporal Dynamic CNN: Efficient Time-Adaptive Network for Text-Independent Speaker Verification Explained with Speaker Activation Map
by Seong-Hu Kim, Hyeonuk Nam, Yong-Hwa Park @ Human Lab, Mechanical Engineering Department, KAIST
This code was written mainly with reference to VoxCeleb_trainer of paper 'In defence of metric learning for speaker recognition'.
For effective extraction of speaker information from various utterances, we propose decomposed temporal dynamic convolution that applies matrix decomposed adaptive convolution depending on time bins as follows:where
Python version of 3.7.10 is used with following libraries
- pytorch == 1.8.1
- pytorchaudio == 0.8.1
- numpy == 1.19.2
- scipy == 1.5.3
- scikit-learn == 0.23.2
We used VoxCeleb1 & 2 dataset in this paper. You can download the dataset by reffering to VoxCeleb1 and VoxCeleb1.
You can train and save model in exps
folder by running:
python trainSpeakerNet.py --model DTDY_ResNet34_half --encoder_type ASP --save_path exps/DTDY_CNN_ResNet34
This implementation also provides accelerating training with distributed training and mixed precision training.
- Use
--distributed
flag to enable distributed training and--mixedprec
flag to enable mixed precision training.- GPU indices should be set before training :
os.environ['CUDA_VISIBLE_DEVICES'] ='0,1'
intrainSpeakernet.py
.
- GPU indices should be set before training :
Network | #Parm | EER (%) | C_det |
---|---|---|---|
DTDY-ResNet-34(×0.25) | 3.29M | 1.59 | 0.130 |
DTDY-ResNet-34(×0.50) | 12.0M | 1.37 | 0.103 |
DTDY-ResNet-34(×0.50)+ASP | 13.6M | 0.96 | 0.086 |
@article{kim2022dtdycnn,
title={Decomposed Temporal Dynamic CNN: Efficient Time-Adaptive Network for Text-Independent Speaker Verification Explained with Speaker Activation Map},
author={Kim, Seong-Hu and Nam, Hyeonuk and Park, Yong-Hwa},
journal={arXiv preprint arXiv:2203.15277},
year={2022}
}
Please contact Seong-Hu Kim at seonghu.kim@kaist.ac.kr for any query.