Abstract
Voice biometric authentication is increasingly gaining adoption in organisations with high-volume identity verifications and for providing access to physical and other virtual spaces. In this form of authentication, the user’s identity is verified with their voice. However, these systems are susceptible to voice spoofing attacks as malicious actors employ different types of attacks such as speech synthesis, voice conversion or imitations, and recorded replays to spoof the Automatic Speaker Verification (ASV) system or for spam communications. In this work, we provide a voice spoofing countermeasure as a binary classification problem, that classifies real and fake audio, and also as a multiclass classification problem to detect voice conversion, synthesis and replay attacks. We investigated numerous audio features and examined each feature capability alongside state-of-the-art deep learning algorithms including convolutional neural networks (CNN), WaveNet, and recurrent neural network variants — Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) models. Using a large dataset of 419,426 audio files for experiments, we evaluated the deep learning models for their effectiveness against voice spoofing attacks. The binary class CNN achieved a false positive rate (FPR) of 0.0216, while the multiclass solutions using CNN, WaveNet, LSTMs and GRUs achieved an FPR of 0.003, 0.0260, 0.0302 and 0.0358 respectively. We extended the evaluation of the models by including the real-time classification using microphone voice audio and user-uploaded audio to demonstrate the practical implications and deployability.
Original language | English |
---|---|
Article number | 100503 |
Number of pages | 16 |
Journal | Machine Learning with Applications |
Volume | 14 |
Early online date | 13 Oct 2023 |
DOIs | |
Publication status | Published - 15 Dec 2023 |