AbstractSegmental exemplar-based speech enhancement algorithms are promising for real-world applications as they can function well without noise training, meaning they can easily generalize to real-world applications. They are, however, typically slow due to needing to search through a corpus that is sufficiently large to be representative. Another key issue is the large size of a representative corpus. This thesis explores two adaptations necessary to meet the requirements of real-time enhancement for lower-powered devices by resolving these two issues.
Firstly, in Chapter 3, the nature of speech is exploited to impose a hierarchical structure on the clean speech corpus, to facilitate a tree-based search of the corpus. A baseline test system is used to demonstrate the concept, which matches fixed length test segments with clean speech corpus segments. These are then used to synthesize output speech. This approach is evaluated and compared with a linear search of the corpus - finding that the algorithm can now function 20x faster than real-time. This is due to the search space for a corpus of n segments being dramatically reduced - from O(n) to O(log(n)).
Secondly, in Chapter 4 we consider the second key issue. The size of the corpus is too large for many low-powered devices. Clustering can be used to obtain a lossy compression of the speech corpus by replacing original segments with codewords. Several different means of clustering for this application are evaluated. It is shown that this results in a corpus a tenth of the size of the original corpus while maintaining the representiveness of the corpus. How compression and tree-based search function together is explored. A technique aimed at reducing the loss in quality when using both techniques together is also explored.
A third study, in Chapter 5, attempts to make use of the lingua-acoustic data present in a sentence to improve the accuracy of automatic speech recognition (ASR) rather than the perceived speech quality. The basic approach leads to a improvement of up to 40% in speech recognition accuracy. Application of the long-term lingua-acoustic data yielded a small but noticeable improvement of around 2% overall. Some of the lingua-acoustic constraints used to exploit this data contributed to the accuracy improvement, while some did not contribute at all. These results are obtained while functioning under real-time on a non-streaming basis. This suggests that the system could be useful for an ASR application that tolerates a small amount of latency.
Finally, future directions for the work are identified.
|Date of Award||Jul 2020|
|Sponsors||Northern Ireland Department for the Economy|
|Supervisor||Ming Ji (Supervisor), Daniel Crookes (Supervisor) & Niall McLaughlin (Supervisor)|
- speech enhancement