Deep Head Pose: Gaze-Direction Estimation in Multimodal Video

Sankha S. Mukherjee, Neil Martin Robertson

Research output: Contribution to journalArticlepeer-review

149 Citations (Scopus)


In this paper we present a convolutional neuralnetwork (CNN)-based model for human head pose estimation inlow-resolution multi-modal RGB-D data. We pose the problemas one of classification of human gazing direction. We furtherfine-tune a regressor based on the learned deep classifier. Next wecombine the two models (classification and regression) to estimateapproximate regression confidence. We present state-of-the-artresults in datasets that span the range of high-resolution humanrobot interaction (close up faces plus depth information) data tochallenging low resolution outdoor surveillance data. We buildupon our robust head-pose estimation and further introduce anew visual attention model to recover interaction with theenvironment. Using this probabilistic model, we show thatmany higher level scene understanding like human-human/sceneinteraction detection can be achieved. Our solution runs inreal-time on commercial hardware
Original languageEnglish
Pages (from-to)2094-2107
Number of pages14
JournalIEEE Transactions on Multimedia
Issue number11
Early online date28 Sept 2015
Publication statusPublished - Nov 2015

Bibliographical note

"This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) Grant number EP/K014277/1, the MOD University Defence Research Collaboration in Signal Processing."


  • Convolutional neural networks (CNNs), deep learning, gaze direction, head-pose, RGB-D


Dive into the research topics of 'Deep Head Pose: Gaze-Direction Estimation in Multimodal Video'. Together they form a unique fingerprint.

Cite this