TY - GEN
T1 - SC-RANK: Improving Convolutional Image Captioning with Self-Critical Learning and Ranking Metric-based Reward
AU - Yan, Shiyang
AU - Hua, Yang
AU - Robertson, Neil
PY - 2019/7/1
Y1 - 2019/7/1
N2 - Image captioning usually employs a Recurrent Neural Network (RNN) to decode the image features from a Convolutional Neural Network (CNN) into a sentence. This RNN model is trained under Maximum Likelihood Estimation (MLE). However, inherent issues like the complex memorising mechanism of the RNNs and the exposure bias introduced by MLE exist in this approach. Recently, the convolutional captioning model shows advantages with a simpler architecture and a parallel training capability. Nevertheless, the MLE training brings the exposure bias which still prevents the model from achieving better performance. In this paper, we prove that the self-critical algorithm can optimise the CNN-based model to alleviate this problem. A ranking metric-based reward, denoted as SC-RANK, is proposed with the sentence embeddings from a pre-trained language model to generate more diversified captions. Applying SC-RANK can avoid the tedious tuning of the specially-designed language model and the knowledge transferred from a pre-trained language model proves to be helpful for image captioning tasks. State-of-the-art results have been obtained in the MSCOCO dataset by proposed SC-RANK.
AB - Image captioning usually employs a Recurrent Neural Network (RNN) to decode the image features from a Convolutional Neural Network (CNN) into a sentence. This RNN model is trained under Maximum Likelihood Estimation (MLE). However, inherent issues like the complex memorising mechanism of the RNNs and the exposure bias introduced by MLE exist in this approach. Recently, the convolutional captioning model shows advantages with a simpler architecture and a parallel training capability. Nevertheless, the MLE training brings the exposure bias which still prevents the model from achieving better performance. In this paper, we prove that the self-critical algorithm can optimise the CNN-based model to alleviate this problem. A ranking metric-based reward, denoted as SC-RANK, is proposed with the sentence embeddings from a pre-trained language model to generate more diversified captions. Applying SC-RANK can avoid the tedious tuning of the specially-designed language model and the knowledge transferred from a pre-trained language model proves to be helpful for image captioning tasks. State-of-the-art results have been obtained in the MSCOCO dataset by proposed SC-RANK.
M3 - Conference contribution
T3 - Communications in Computer and Information Science
BT - Proceedings of the British Machine Vision Conference 2019
PB - Springer
ER -