TY - GEN
T1 - Inferring Emphasis for Real Voice Data: An Attentive Multimodal Neural Network Approach
AU - Zhou, Suping
AU - Jia, Jia
AU - Zhang, Long
AU - Wang, Yanfeng
AU - Chen, Wei
AU - Meng, Fanbo
AU - Yu, Fei
AU - Shen, Jialie
PY - 2020
Y1 - 2020
N2 - To understand speakers’ attitudes and intentions in real Voice Dialogue Applications (VDAs), effective emphasis inference from users’ queries may play an important role. However, in VDAs, there are tremendous amount of uncertain speakers with a great diversity of users’ dialects, expression preferences, which challenge the traditional emphasis detection methods. In this paper, to better infer emphasis for real voice data, we propose an attentive multimodal neural network. Specifically, first, beside the acoustic features, extensive textual features are applied in modelling. Then, considering the feature in-dependency, we model the multi-modal features utilizing a Multi-path convolutional neural network (MCNN). Furthermore, combining high-level multi-modal features, we train an emphasis classifier by attending on the textual features with an attention-based bidirectional long short-term memory network (ABLSTM), to comprehensively learn discriminative features from diverse users. Our experimental study based on a real-world dataset collected from Sogou Voice Assistant (https://yy.sogou.com/) show that our method outperforms (over 1.0–15.5% in terms of F1 measure) alternative baselines.
AB - To understand speakers’ attitudes and intentions in real Voice Dialogue Applications (VDAs), effective emphasis inference from users’ queries may play an important role. However, in VDAs, there are tremendous amount of uncertain speakers with a great diversity of users’ dialects, expression preferences, which challenge the traditional emphasis detection methods. In this paper, to better infer emphasis for real voice data, we propose an attentive multimodal neural network. Specifically, first, beside the acoustic features, extensive textual features are applied in modelling. Then, considering the feature in-dependency, we model the multi-modal features utilizing a Multi-path convolutional neural network (MCNN). Furthermore, combining high-level multi-modal features, we train an emphasis classifier by attending on the textual features with an attention-based bidirectional long short-term memory network (ABLSTM), to comprehensively learn discriminative features from diverse users. Our experimental study based on a real-world dataset collected from Sogou Voice Assistant (https://yy.sogou.com/) show that our method outperforms (over 1.0–15.5% in terms of F1 measure) alternative baselines.
KW - Attention
KW - Emphasis detection
KW - Voice dialogue applications
U2 - 10.1007/978-3-030-37734-2_5
DO - 10.1007/978-3-030-37734-2_5
M3 - Conference contribution
AN - SCOPUS:85080950585
SN - 9783030377335
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 52
EP - 62
BT - MultiMedia Modeling - 26th International Conference, MMM 2020, Proceedings
A2 - Ro, Yong Man
A2 - Kim, Junmo
A2 - Choi, Jung-Woo
A2 - Cheng, Wen-Huang
A2 - Chu, Wei-Ta
A2 - Cui, Peng
A2 - Hu, Min-Chun
A2 - De Neve, Wesley
PB - Springer
T2 - 26th International Conference on MultiMedia Modeling, MMM 2020
Y2 - 5 January 2020 through 8 January 2020
ER -