Speech Enhancement Based on Full-Sentence Correlation and Clean Speech Recognition

Ming Ji, Daniel Crookes

Research output: Contribution to journalArticlepeer-review

33 Citations (Scopus)
630 Downloads (Pure)

Abstract

Conventional speech enhancement methods, based on frame, multi-frame or segment estimation, require knowledge about the noise. This paper presents a new method which aims to reduce or effectively remove this requirement. It is shown that, by using the Zero-mean Normalized Correlation Coefficient (ZNCC) as the comparison measure, and by extending the effective length of speech segment matching to sentencelong speech utterances, it is possible to obtain an accurate speech estimate from noise without requiring specific knowledge about the noise. The new method, thus, could be used to deal with unpredictable noise or noise without proper training data. This paper is focused on realizing and evaluating this potential. We propose a novel realization that integrates full-sentence speech correlation with clean speech recognition, formulated as a constrained maximization problem, to overcome the data sparsity problem. Then we propose an efficient implementation algorithm to solve this constrained maximization problem, to produce speech sentence estimates. For evaluation, we build the new system on one training data set and test it on two different test data sets across two databases, for a range of different noises including highly nonstationary ones. It is shown that the new approach, without any estimation of the noise, is able to significantly outperform conventional methods which use optimized noise tracking, in terms of various objective measures including automatic speech recognition.
Original languageEnglish
Pages (from-to)531-543
Number of pages13
JournalIEEE/ACM Transactions on Audio, Speech, and Language Processing
Volume25
Issue number3
Early online date11 Jan 2017
DOIs
Publication statusPublished - Mar 2017

Fingerprint

Dive into the research topics of 'Speech Enhancement Based on Full-Sentence Correlation and Clean Speech Recognition'. Together they form a unique fingerprint.

Cite this