Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons learned

In this blog post, I will share with you some insight and lessons learned from our recent work that should work in theory (i.e. BERT+GloVe for Semantic Similarity), but in practice, it doesn’t work in our scenario. Recent state-of-the-art progress in pre-trained vision and language and image (e.g. captioning models) relies heavily on long training on abundant data. However, these accuracy improvements depend on long iterations of training and the availability of computational resources (i.e. GPU, TPU, etc.), which leads to time and energy consumption . In some cases, the improvements after re-training are less than 1 point in the benchmark dataset. In this work, we introduce an approach that can be applied to any caption system as a post-processing-based method that only needs to be trained once. In particular, we propose an approach for improving caption generation systems by choosing the most closely related output to the image rather than the most likely output produced by the model. Our model revises the language generation output beam search from a visual context perspective.


Image Captioning System. Automatic image captioning is a fundamental task that incorporates vision and language. The task can be tackled in two stages: first, image-visual information extraction and then linguistic description generation. Most models couple the relations between visual and linguistic information via a Convolutional Neural Network (CNN) to encode the input image and Long Short Term Memory for language generation (LSTM). Recently, self-attention has been used to learn these relations via Transformers or Transformer-based encoder models like Vision and Language BERT . These systems show promising results on benchmark datasets such as Flikr and COCO datasets. However, the generated caption lexical diversity remains a relatively unexplored research problem. Lexical diversity refers to how accurate the generated description is for a given image. An accurate caption should provide details about specific and relevant aspects of the image. Caption lexical diversity can be divided into three levels: word level (different words), syntactic level (word order), and semantic level (relevant concepts) . In this work, we approach word-level diversity by learning the semantic correlation between the caption and its visual context, as shown in Figure 1 (below), where the visual information from the image is used to learn the semantic relation from the caption in a word and sentence manner.

Visual Context Image Captioning System. Modern sophisticated image captioning systems focus heavily on visual grounding to capture real-world scenarios. Early works built a visual detector to guide and re-rank image captions with a global similarity. The work of investigates the informativeness of object information (e.g. object frequency) in end-to-end caption generation. Cornia et al. propose controlled caption language grounding through visual regions from the image . Chen et al. rely on scene concept abstract (object, relationship, and attribute) grounded in the image to learn accurate semantics without labels for image captioning. More recently, Zhang et al. incorporate different concepts such as scene graph, object, and attribute to learn correct linguistic and visual relevance for better caption language grounding.

Inspired by these works, that uses re-ranking via visual information, Wang et al. and Cornia et al. that explored the benefit of object information in image captioning, Gupta et al. that benefits of language modeling to extract contextualized word representations and the exploitation of the semantic coherency in caption language grounding , we propose a visual grounding-based object scorer to re-rank the most closely related caption with both static and contextualized semantic similarity.

Beam Search Caption Extraction. We employ the three most common architectures for caption generation to extract the top beam search. The first baseline is based on the standard shallow CNN-LSTM model . The second, VilBERT is fine-tuned on a total of 12 different vision and language datasets such as caption image retrieval. Finally, the third baseline is a specialized Transformer based caption generator .

Learning Word to Sentence Visual Semantic

Problem Formulation. Beam search is the dominant method for approximate decoding in structured prediction tasks such as machine translation, speech recognition, and image captioning. The larger beam size allows the model to perform a better exploration of the search space compared to greedy decoding. Our goal is to leverage the visual context information of the image to re-rank the candidate sequences obtained through the beam search, thereby moving the most visually relevant candidate up in the list, as well as moving incorrect candidates down.

Figure 1 Architecture. Detail of the proposed architecture to estimate the semantic score between the candidate caption (provided by an off-the-shelf image caption system) and the context in the image. We employ the visual context in word and sentence level manner from the image to re-rank the most related caption to its visual context. An example from caption Transformer generator shows the visual re-ranker (Visual Beam) uses the semantic relation to re-ranked the most descriptive caption.

Word level similarity. To learn the semantic relation between a caption and its visual context in a word-level manner: first, we employ a bidirectional LSTM based CopyRNN keyphrase extractor to extract keyphrasesNote that, a pre-trained out-of-the-box model is also available such as KeyBERT. from the sentence as context. The model is trained on combined pre-processed datasets (1) wikidump (i.e. keyword, short sentence)and (2) SemEval 2017 Task 10 (Keyphrases from scientific publications) . Secondly, GloVe is used to compute the cosine similarity between the extracted context from the caption and its related visual context. Lastly, following we use the object confidence score (i.e. visual context) in the image to convert the similarity to probability A conditional probability model converts the similarity to probability by revising a given hypothesis if two conditions are satisfied. In our case, hypothesis: the extracted keyphrases from the caption, and the condition to revise the hypothesis are : (1) the informativeness of the visual context from the image (confidence score), and the degree of the semantic relatedness between them - similarity score (keyphrase, visual context. Note that, we employ a language model (e.g. unigram language model) to initialize the hypothesis, which is recommended by the authors. . For example, “a woman in a red dress and a black skirt walks down a sidewalk” the model will extract dress and walks, which are the highlights keywords of the caption.

Sentence level similarity. We fine-tune the BERT base model to learn the visual context information. The model learns a dictionary-like relation word-to-sentence paradigm. We use the visual data as context for the sentence via cosine distance.

BERT . BERT achieves remarkable results on many sentence level tasks and especially in the semantic textual similarity task (STS-B). Therefore, we fine-tuned BERT-base on the training dataset, (textual information, 460k captions: 373k for training and 87k for validation) i.e. visual, caption, label ([semantically related or not related]), with a binary classification cross-entropy loss function [0,1] where the target is the semantic similarity between the visual and the candidate caption.

Sentence RoBERTa . RoBERTa is an improved version of BERT, and since RoBERTa-Large is more robust, we rely on pre-trained Sentence-RoBERTa-sts (fined-tuned on the sts task) as it yields a better cosine score.

Fusion Similarity Expert. Product of experts (PoE) implies an effort into combining the expertise of each expert (model) in a collaborative manner. It allows each expert to specialize in analyzing one particular aspect of the problem and establishing a judgment based on that aspect. Inspired by PoE, we combined the two experts word and sentence level as last fusion as shown in Figure 1. PoE takes advantage of each expert and can produce much sharper distributions than a single model. The combined probability of a given candidate caption \(\mathbf{w}\), with the semantic relation with the visual context, can be written as: MathJax example

\[ P\left(\mathbf{w} | \theta_{1} .. \theta_{n}\right)=\arg \max _{\mathbf{w}} \frac{\Pi_{m} p_{m}\left(\mathbf{w} | \theta_{m}\right)} {\sum_{\mathbf{c}} \Pi_{m} p_{m}\left(\mathbf{c} | \theta_{m}\right)} \]

where \(θm\) are the parameters of each model \(m\), \(pm(\mathbf{w}|θm)\) is the probability of \(\mathbf{w}\) under the model \(m\) and \(c\) are the indexes of all possible vector in the data space. Since we are just interested in retrieving the candidate caption with higher probability after re-ranking, we do not need to normalize. Therefore, we compute:

\[\arg \max _{\mathbf{w}} {\Pi_{m} p_{m}\left(\mathbf{w} | \theta_{m}\right)}\]

where, \(p_{m}\left(\mathbf{w} | \theta_{m}\right)\) are the probabilities (i.e. semantic relatedness score) assigned by each expert to the candidate caption \(\mathbf{w}\) with the semantic relation with the visual context of the image.


We evaluate the proposed approach on two different sized datasets. The idea is to evaluate our method on the most common caption dataset in two scenarios: (1) a shallow model CNN-LSTM (i.e. less data), as well as a system that is trained on a huge amount of data (e.g. Transformer).

♠ Flickr8k . The dataset contains 8k images, each image has five human label annotated captions. We use this data to train the shallow model CNN-LSTM (6270 train/1730 test).

♣ COCO . It contains around 120k images, and each image is annotated with five different human label captions. We use the most used split that is provided by Karpathy et al. test set , where 5k images are used for testing and 5k for validation, and the rest for model training for the Transformer baseline.

Visual Context Dataset. Since there are many public datasets for image captioning, they contain no textual visual information such as objects in the image. We enrich the two datasets, mentioned above, with textual visual context information. In particular, to automate visual context generation and without the need for human labeling, we use ResNet-152 to extract top-k 3 visual context information for each image in the caption dataset.

Evaluation Metric. We use the official COCO offline evaluation suite, producing several widely used caption quality metrics: BLEU METEOR , ROUGE , CIDEr , and BERTscore or (B-S) .

Results and Analysis

We use visual semantic information to re-rank candidate captions produced by out-of-the-box state-of-the-art caption generators. We extract top-20 beam search candidate captions from three different architectures (1) standard CNN+LSTM model , (2) a pre-trained language and vision model VilBERT , fine-tuned on a total of 12 different vision and language datasets such as caption image retrieval, and (3) a specialized caption-based Transformer .

Model B-4 M R C B-S
♠ Show and Tell-BeamS 0.035 0.093 0.270 0.035 0.8871
Tell+VR_V1-BERT-Glove 0.035 0.095 0.273 0.036 0.8855
Tell+VR_V2-BERT-Glove 0.037 0.099 0.277 0.041 0.8850
Tell+VR_V1-RoBERTa-Glove (sts) 0.037 0.101 0.273 0.036 0.8839
Tell+VR_V2-RoBERTa-Glove (sts) 0.035 0.095 0.273 0.036 0.8869
♣ VilBERT-BeamS 0.336 0.271 0.543 1.027 0.9363
Vil+VR_V1-BERT-Glove 0.334 0.273 0.544 1.034 0.9365
Vil+VR_V2-BERT-Glove 0.334 0.273 0.545 1.034 0.9365
Vil+VR_V1-RoBERTa-Glove (sts) 0.335 0.273 0.544 1.036 0.9365
Vil+VR_V2-RoBERTa-Glove (sts) 0.338 0.272 0.545 1.040 0.9366
♣ Transformer-BeamS 0.374 0.278 0.569 1.153 0.9399
Trans+VR_V1-BERT-Glove 0.371 0.278 0.567 1.149 0.9398
Trans+VR_V2-BERT-Glove 0.371 0.278 0.568 1.150 0.9399
Trans+VR_V1-RoBERTa-Glove (sts) 0.370 0.277 0.567 1.145 0.9395
Trans+VR_V2-RoBERTa-Glove(sts) 0.370 0.277 0.567 1.145 0.9395
Table 1 Performance of compared baselines on the ♠ Flikr and ♣ COCO test split.
The ♠ refers to the Flikr 1730 test set from the Flikr8k dataset, and the ♣ refers to the COCO Caption Karpathy 5k test set.

Experiments applying different re-rankers to each base system are shown in Table 1 (above). The tested re-rankers are: (1) VR_BERT+GloVe, which uses BERT and GloVe similarity between the candidate caption and the visual context (top-k visual 1 (V_1) and visual 2 (V_2) during the inference once at a time) to obtain the re-ranked score. (2) VR_RoBERTa+GloVe, which carries out the same procedure using similarity produced by Sentence RoBERTa.

Our re-ranker produced mixed results as the model struggles when the beam search is less diverse. The model is therefore not able to select the most closely related caption to its environmental context as shown in Figure 2, which is a visualization of the final visual beam re-ranking.

Figure 2 Visualization of the top-15 beam search after visual re-ranking. The babyblue color \(\leq 0.4\) and \(\leq 0.8\) darksalmon represents the degree of change in probability after visual re-ranking, respectively. Also, we can observe that a less diverse beam negatively impacted the score, as in the case of Transformer and show and tell baselines.

Evaluation of Lexical Diversity. As shown in Table 2 (below), we evaluate the model from a lexical diversity perspective. We can conclude that we have (1) more vocabulary, and (2) the Unique word per caption is also improved, even with a lower Type-Token Ratio TTRTTR is the number of unique words or types divided by the total number of tokens in a text fragment..

Although this approach re-ranks higher diversity caption, the improvement is not strong enough to impact the benchmark result positively as shown in Table 1.

Model Voc TTR Uniq WPC
♠ Show and Tell-BeamS 304 0.79 10.4 12.7
Tell+VR-RoBERTa 310 0.82 9.42 13.5
♣ VilBERT-BeamS 894 0.87 8.05 10.5
Vil+VR-RoBERTa 953 0.85 8.86 10.8
♣ Transformer-BeamS 935 0.86 7.44 9.62
Trans+VR-BERT 936 0.86 7.48 8.68
Table 2 Measuring the lexical diversity of caption w/o and w/ re-ranking.
Uniq and WPC columns indicate the average of unique/total words per caption, respectively.
The ♠ refers to the Fliker 1730 test set, ♣ refers to the COCO Karpathy 5k test set.

Ablation Study. We performed an ablation study to investigate the effectiveness of each model. As to the proposed architecture, each expert tried to learn different representations in a word and sentence manner. In this experiment, we trained each model separately, as shown in Table 3 (Bottom ♣). The GloVe as a stand-alone performed better than the combined model (and thus, the combined model breaks the accuracy). To investigate this even further we visualized each expert before the fusion layers as shown in Figure 3.

Figure 3 (\(\mathbf{Top}\) ♠) 1k random sample from Flickr test set with shown and tell model. Each Expert is contributing different probability confidence and therefore the model is learning the semantic relation in word level and sentence level. (\(\mathbf{Bottom}\) ♣) Karpathy 5k test set from COCO caption with Transformer based caption model. The GloVe score is dominating the distribution to become the expert.
Model B-4 M R C B-S
Transformer-BeamS 0.374 0.278 0.569 1.153 0.9399
+VR-RoBERT-GloVe 0.370 0.277 0.567 1.145 0.9395
+VR-BERT-GloVe 0.370 0.371 0.278 1.149 0.9398
+VR-RoBERT-BERT 0.369 0.278 0.567 1.144 0.9395
+VR_V1-GloVe 0.371 0.278 0.568 1.148 0.9398
+VR_V2-GloVe 0.371 0.371 0.278 1.149 0.9398
Table 2 Ablation study. We analyze the contribution of each Expert to the final accuracy.
The color represents Figure 3. ♣ Bottom Figure 3 shows that BERT is not contributing, as GloVe, to the final score.

Limitation. In contrast to CNN-LSTM (♠ top Figure 3), where each expert is contributing to the final decisions, we observed that having a shorter caption (with less context) can influence the BERT similarity score negatively. Therefore, the GloVe dominates as the main expert as shown in Figure 3 (♣ Bottom). Another limitation is the fluctuating of independent stand-alone word similarity score i.e. keyphrases from the caption and the visual sim(keyphrase, visual). Also, the visual classifier struggles with complex backgrounds (i.e. misclassified and hallucinated objects), which results in an inaccurate semantic score.


In this work, we introduce an approach that overcomes the limitation of beam search and avoids re-training for better accuracy. We proposed a combined word and sentence visual beam search re-ranker. However, we discover that word and sentence similarity disagree with each other when the beam search is less diverse. Our experiments also highlight the usefulness of the model by showing successful cases.

We can also cite external publications.