To understand speakers’ attitudes and intentions in real Voice Dialogue Applications (VDAs), effective emphasis inference from users’ queries may play an important role. However, in VDAs, there are tremendous amount of uncertain speakers with a great diversity of users’ dialects, expression preferences, which challenge the traditional emphasis detection methods. In this paper, to better infer emphasis for real voice data, we propose an attentive multimodal neural network. Specifically, first, beside the acoustic features, extensive textual features are applied in modelling. Then, considering the feature in-dependency, we model the multi-modal features utilizing a Multi-path convolutional neural network (MCNN). Furthermore, combining high-level multi-modal features, we train an emphasis classifier by attending on the textual features with an attention-based bidirectional long short-term memory network (ABLSTM), to comprehensively learn discriminative features from diverse users. Our experimental study based on a real-world dataset collected from Sogou Voice Assistant (https://yy.sogou.com/) show that our method outperforms (over 1.0–15.5% in terms of F1 measure) alternative baselines.