DTDY-CNN for Text-Independent Speaker Verification

Official implementation of

Decomposed Temporal Dynamic CNN: Efficient Time-Adaptive Network for Text-Independent Speaker Verification Explained with Speaker Activation Map
by Seong-Hu Kim, Hyeonuk Nam, Yong-Hwa Park @ Human Lab, Mechanical Engineering Department, KAIST

This code was written mainly with reference to VoxCeleb_trainer of paper 'In defence of metric learning for speaker recognition'.

Decomposed Temporal Dynamic Convolution

For effective extraction of speaker information from various utterances, we propose decomposed temporal dynamic convolution that applies matrix decomposed adaptive convolution depending on time bins as follows:

$${y}(f,t) =W(t)*x(f,t)$$

$$W(t) ={W}_{0}+P\Phi (t) Q^{T}$$

where $x$ and $y$ are input and output of DTDY-CNN module which depends on frequency feature $f$ and time feature $t$ in time-frequency domain data. $W(t)$ is composed of static kernel $W_{0}$ and dynamic residual $P \Phi Q^{T}$, and temporal dynamic matrix $\Phi (t) \in \mathbb{R}^{L \times L}$ is a linear transformation matrix in the $L$-dimensional latent space, and there are different linear transformations for each time bin $t$.

Requirements and versions used

Python version of 3.7.10 is used with following libraries

pytorch == 1.8.1
pytorchaudio == 0.8.1
numpy == 1.19.2
scipy == 1.5.3
scikit-learn == 0.23.2

Dataset

We used VoxCeleb1 & 2 dataset in this paper. You can download the dataset by reffering to VoxCeleb1 and VoxCeleb1.

Training

You can train and save model in exps folder by running:

python trainSpeakerNet.py --model DTDY_ResNet34_half --encoder_type ASP --save_path exps/DTDY_CNN_ResNet34

This implementation also provides accelerating training with distributed training and mixed precision training.

Use --distributed flag to enable distributed training and --mixedprec flag to enable mixed precision training.
- GPU indices should be set before training : os.environ['CUDA_VISIBLE_DEVICES'] ='0,1' in trainSpeakernet.py.

Results:

Network	#Parm	EER (%)	C_det
DTDY-ResNet-34(×0.25)	3.29M	1.59	0.130
DTDY-ResNet-34(×0.50)	12.0M	1.37	0.103
DTDY-ResNet-34(×0.50)+ASP	13.6M	0.96	0.086

Citation

@article{kim2022dtdycnn,
  title={Decomposed Temporal Dynamic CNN: Efficient Time-Adaptive Network for Text-Independent Speaker Verification Explained with Speaker Activation Map},
  author={Kim, Seong-Hu and Nam, Hyeonuk and Park, Yong-Hwa},
  journal={arXiv preprint arXiv:2203.15277},
  year={2022}
}

Please contact Seong-Hu Kim at seonghu.kim@kaist.ac.kr for any query.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
loss		loss
models		models
optimizer		optimizer
scheduler		scheduler
DatasetLoader.py		DatasetLoader.py
README.md		README.md
SpeakerNet.py		SpeakerNet.py
dtdy_structure.png		dtdy_structure.png
trainSpeakerNet.py		trainSpeakerNet.py
tuneThreshold.py		tuneThreshold.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DTDY-CNN for Text-Independent Speaker Verification

Decomposed Temporal Dynamic Convolution

Requirements and versions used

Dataset

Training

Results:

Citation

About

Releases

Packages

Languages

shkim816/decomposed_temporal_dynamic_cnn

Folders and files

Latest commit

History

Repository files navigation

DTDY-CNN for Text-Independent Speaker Verification

Decomposed Temporal Dynamic Convolution

Requirements and versions used

Dataset

Training

Results:

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages