Inferring Emphasis for Real Voice Data: An Attentive Multimodal Neural Network Approach

Suping Zhou, Jia Jia*, Long Zhang, Yanfeng Wang, Wei Chen, Fanbo Meng, Fei Yu, Jialie Shen

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)


To understand speakers’ attitudes and intentions in real Voice Dialogue Applications (VDAs), effective emphasis inference from users’ queries may play an important role. However, in VDAs, there are tremendous amount of uncertain speakers with a great diversity of users’ dialects, expression preferences, which challenge the traditional emphasis detection methods. In this paper, to better infer emphasis for real voice data, we propose an attentive multimodal neural network. Specifically, first, beside the acoustic features, extensive textual features are applied in modelling. Then, considering the feature in-dependency, we model the multi-modal features utilizing a Multi-path convolutional neural network (MCNN). Furthermore, combining high-level multi-modal features, we train an emphasis classifier by attending on the textual features with an attention-based bidirectional long short-term memory network (ABLSTM), to comprehensively learn discriminative features from diverse users. Our experimental study based on a real-world dataset collected from Sogou Voice Assistant ( show that our method outperforms (over 1.0–15.5% in terms of F1 measure) alternative baselines.

Original languageEnglish
Title of host publicationMultiMedia Modeling - 26th International Conference, MMM 2020, Proceedings
EditorsYong Man Ro, Junmo Kim, Jung-Woo Choi, Wen-Huang Cheng, Wei-Ta Chu, Peng Cui, Min-Chun Hu, Wesley De Neve
Number of pages11
ISBN (Print)9783030377335
Publication statusPublished - 2020
Event26th International Conference on MultiMedia Modeling, MMM 2020 - Daejeon, Korea, Republic of
Duration: 05 Jan 202008 Jan 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11962 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference26th International Conference on MultiMedia Modeling, MMM 2020
Country/TerritoryKorea, Republic of


  • Attention
  • Emphasis detection
  • Voice dialogue applications

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)


Dive into the research topics of 'Inferring Emphasis for Real Voice Data: An Attentive Multimodal Neural Network Approach'. Together they form a unique fingerprint.

Cite this