Recently, deep learning techniques have substantially boosted the performance of salient object detection in still images. However, the salient object detection in videos by using traditional handcrafted features or deep learning features is not fully investigated, probably due to the lack of sufficient manually labeled video data for saliency modeling, especially for the data-driven deep learning. This paper proposes a novel weakly supervised approach to salient object detection in a video, which can learn a robust saliency prediction model by using very limited manually labeled data and a large amount of weakly labeled data that could be easily generated in a supervised approach. Furthermore, we propose a spatiotemporal cascade neural network (SCNN) architecture for saliency modeling, in which two fully convolutional networks are cascaded to evaluate visual saliency from both spatial and temporal cues to lead the optimal video saliency prediction. The proposed approach is extensively evaluated on the widely used challenging datasets, and the experiments demonstrate that our proposed approach substantially outperforms the state-of-the-art salient object detection models.
|Number of pages||12|
|Journal||IEEE Transactions on Circuits and Systems for Video Technology|
|Early online date||25 Jul 2018|
|Publication status||Early online date - 25 Jul 2018|