TY - JOUR
T1 - Video salient object detection via spatiotemporal attention neural networks
AU - Tang, Yi
AU - Zou, Wenbin
AU - Hua, Yang
AU - Jin, Zhi
AU - Li, Xia
PY - 2019/9/25
Y1 - 2019/9/25
N2 - Recently, deep convolutional neural networks have been widely introduced into image salient object detection and achieve good performance in this community. However, as the complexity of video scenes, video salient object detection with deep learning models is still a challenge. The specific difficulties come from two aspects. First of all, the deep networks on image saliency detection cannot capture robust motion cues in video sequences. Secondly, as for the spatiotemporal fusing features, the existing methods simply exploit element-wise addition or concatenation, which not fully explores the contextual information and complementary correlation, thus they cannot produce more robust spatiotemporal features. To address these issues, we propose a two-stream based spatiotemporal attention neural network (STAN) for video salient object detection. We amply extract the motion information in terms of long short term memory (LSTM) network and 3D convolutional operation from optical flow-based prior and video sequences. Moreover, an attentive module is designed to integrate the different types of spatiotemporal feature maps by learning the corresponding weights. Meanwhile, in order to generate sufficient pixel-wise annotated video frames, we manually generate lots of coarse labels, which are well utilized to train a robust saliency prediction network. Experiments on the widely used challenging datasets (e.g., FBMS and DAVIS) prove that the proposed STAN has competitive performances among salient object detection methods.
AB - Recently, deep convolutional neural networks have been widely introduced into image salient object detection and achieve good performance in this community. However, as the complexity of video scenes, video salient object detection with deep learning models is still a challenge. The specific difficulties come from two aspects. First of all, the deep networks on image saliency detection cannot capture robust motion cues in video sequences. Secondly, as for the spatiotemporal fusing features, the existing methods simply exploit element-wise addition or concatenation, which not fully explores the contextual information and complementary correlation, thus they cannot produce more robust spatiotemporal features. To address these issues, we propose a two-stream based spatiotemporal attention neural network (STAN) for video salient object detection. We amply extract the motion information in terms of long short term memory (LSTM) network and 3D convolutional operation from optical flow-based prior and video sequences. Moreover, an attentive module is designed to integrate the different types of spatiotemporal feature maps by learning the corresponding weights. Meanwhile, in order to generate sufficient pixel-wise annotated video frames, we manually generate lots of coarse labels, which are well utilized to train a robust saliency prediction network. Experiments on the widely used challenging datasets (e.g., FBMS and DAVIS) prove that the proposed STAN has competitive performances among salient object detection methods.
U2 - 10.1016/j.neucom.2019.09.064
DO - 10.1016/j.neucom.2019.09.064
M3 - Article
SN - 0925-2312
JO - Neurocomputing
JF - Neurocomputing
ER -