In this paper, a discriminative human pose estimation system based on deep learning is proposed for monocular video-sequences. Our approach combines a simple but efficient Convolutional Neural Network that directly regresses the 3D pose estimation with a recurrent denoising autoencoder that provides pose refinement using the temporal information contained in the sequence of previous frames. Our architecture is also able to provide an integrated training between both parts in order to better model the space of activities, where noisy but realistic poses using the partially trained CNN are used to enhance the training of the autoencoder. The system has been evaluated in two standard datasets, HumanEva-I and Human3.6M, comprising more than 15 different activities. We show that our simple architecture can provide state of the art results.
|Title of host publication||Proceeding of the X Conference on Articulated Motion and Deformable Objects AMDO 2018|
|Number of pages||11|
|Publication status||Published - 01 Sep 2018|
|Name||Image Processing, Computer Vision, Pattern Recognition, and Graphics|