Image captioning usually employs a Recurrent Neural Network (RNN) to decode the image features from a Convolutional Neural Network (CNN) into a sentence. This RNN model is trained under Maximum Likelihood Estimation (MLE). However, inherent issues like the complex memorising mechanism of the RNNs and the exposure bias introduced by MLE exist in this approach. Recently, the convolutional captioning model shows advantages with a simpler architecture and a parallel training capability. Nevertheless, the MLE training brings the exposure bias which still prevents the model from achieving better performance. In this paper, we prove that the self-critical algorithm can optimise the CNN-based model to alleviate this problem. A ranking metric-based reward, denoted as SC-RANK, is proposed with the sentence embeddings from a pre-trained language model to generate more diversified captions. Applying SC-RANK can avoid the tedious tuning of the specially-designed language model and the knowledge transferred from a pre-trained language model proves to be helpful for image captioning tasks. State-of-the-art results have been obtained in the MSCOCO dataset by proposed SC-RANK.
|Title of host publication||Proceedings of the British Machine Vision Conference 2019|
|Number of pages||14|
|Publication status||Accepted - 01 Jul 2019|
|Name||Communications in Computer and Information Science |