Detail publikace

Analysis and Optimization of Bottleneck Features for Speaker Recognition

LOZANO DÍEZ, A. SILNOVA, A. MATĚJKA, P. GLEMBEK, O. PLCHOT, O. PEŠÁN, J. BURGET, L. GONZALEZ-RODRIGUEZ, J.

Originální název

Analysis and Optimization of Bottleneck Features for Speaker Recognition

Anglický název

Analysis and Optimization of Bottleneck Features for Speaker Recognition

Jazyk

en

Originální abstrakt

Recently, Deep Neural Network (DNN) based bottleneck features proved to be very effective in i-vector based speaker recognition. However, the bottleneck feature extraction is usually fully optimized for speech rather than speaker recognition task. In this paper, we explore whether DNNs suboptimal for speech recognition can provide better bottleneck features for speaker recognition. We experiment with different features optimized for speech or speaker recognition as input to the DNN. We also experiment with under-trained DNN, where the training was interrupted before the full convergence of the speech recognition objective. Moreover, we analyze the effect of normalizing the features at the input and/or at the output of bottleneck features extraction to see how it affects the final speaker recognition system performance. We evaluated the systems in the SRE10, condition 5, female task. Results show that the best configuration of the DNN in terms of phone accuracy does not necessary imply better performance of the final speaker recognition system. Finally, we compare the performance of bottleneck features and the standard MFCC features in i-vector/PLDA speaker recognition system. The best bottleneck features yield up to 37% of relative improvement in terms of EER.

Anglický abstrakt

Recently, Deep Neural Network (DNN) based bottleneck features proved to be very effective in i-vector based speaker recognition. However, the bottleneck feature extraction is usually fully optimized for speech rather than speaker recognition task. In this paper, we explore whether DNNs suboptimal for speech recognition can provide better bottleneck features for speaker recognition. We experiment with different features optimized for speech or speaker recognition as input to the DNN. We also experiment with under-trained DNN, where the training was interrupted before the full convergence of the speech recognition objective. Moreover, we analyze the effect of normalizing the features at the input and/or at the output of bottleneck features extraction to see how it affects the final speaker recognition system performance. We evaluated the systems in the SRE10, condition 5, female task. Results show that the best configuration of the DNN in terms of phone accuracy does not necessary imply better performance of the final speaker recognition system. Finally, we compare the performance of bottleneck features and the standard MFCC features in i-vector/PLDA speaker recognition system. The best bottleneck features yield up to 37% of relative improvement in terms of EER.

Dokumenty

BibTex


@inproceedings{BUT131002,
  author="Alicia {Lozano Díez} and Anna {Silnova} and Pavel {Matějka} and Ondřej {Glembek} and Oldřich {Plchot} and Jan {Pešán} and Lukáš {Burget} and Joaquin {Gonzalez-Rodriguez}",
  title="Analysis and Optimization of Bottleneck Features for Speaker Recognition",
  annote="Recently, Deep Neural Network (DNN) based bottleneck features proved to be very
effective in i-vector based speaker recognition. However, the bottleneck feature
extraction is usually fully optimized for speech rather than speaker recognition
task. In this paper, we explore whether DNNs suboptimal for speech recognition
can provide better bottleneck features for speaker recognition. We experiment
with different features optimized for speech or speaker recognition as input to
the DNN. We also experiment with under-trained DNN, where the training was
interrupted before the full convergence of the speech recognition objective.
Moreover, we analyze the effect of normalizing the features at the input and/or
at the output of bottleneck features extraction to see how it affects the final
speaker recognition system performance. We evaluated the systems in the SRE10,
condition 5, female task. Results show that the best configuration of the DNN in
terms of phone accuracy does not necessary imply better performance of the final
speaker recognition system. Finally, we compare the performance of bottleneck
features and the standard MFCC features in i-vector/PLDA speaker recognition
system. The best bottleneck features yield up to 37% of relative improvement in
terms of EER.",
  address="International Speech Communication Association",
  booktitle="Proceedings of Odyssey 2016",
  chapter="131002",
  doi="10.21437/Odyssey.2016-51",
  edition="NEUVEDEN",
  howpublished="online",
  institution="International Speech Communication Association",
  number="06",
  year="2016",
  month="june",
  pages="352--357",
  publisher="International Speech Communication Association",
  type="conference paper"
}