## Word to Sentence Visual Semantic Similarity for Caption Generation: Lesson learned

In this blog post, I will share with you some insight and lessons learned from our recent research idea that should work in theory (i.e., BERT+GloVe), but in practice, it doesn’t work in our scenario. Recent state-of-the-art progress in pre-trained vision and language and image captioning models relies heavily on long training on abundant data. However, these accuracy improvements depend on long iterations of training and the availability of computational resources (i.e., GPU, TPU, etc), which leads to time and energy consumption . In some cases, the improvements after re-training are less than 1 point in the benchmark dataset. In this work, we introduce an approach that can be applied to any caption system as a post-processing-based method that only needs to be trained once. In particular, we propose an approach for improving caption generation systems by choosing the most closely related output to the image rather than the most likely output produced by the model. Our model revises the language generation output beam search from a visual context perspective

## Introduction#

Image Captioning System. Automatic caption is a fundamental task that incorporates vision and language. The task can be tackled in two stages: first, image-visual information extraction and then linguistic description generation. Most models couple the relations between visual and linguistic information via a Convolutional Neural Network (CNN) to encode the input image and Long Short Term Memory for language generation (LSTM). Recently, self-attention has been used to learn these relations via Transformers or Transformer-based models like Vision and Language BERT . These systems show promising results on benchmark datasets such as COCO . However, the generated caption lexical diversity remains a relatively unexplored research problem. Lexical diversity refers to how accurate the generated description is for a given image. An accurate caption should provide details about specific and relevant aspects of the image. Caption lexical diversity can be divided into three levels: word level (different words), syntactic level (word order), and semantic level (relevant concepts) . In this work, we approach word-level diversity by learning the semantic correlation between the caption and its visual context, as shown in Figure 1 (below), where the visual information from the image is used to learn the semantic relation from the caption in a word and sentence manner.

Visual Context Image Captioning System. Modern sophisticated image captioning systems focus heavily on visual grounding to capture real-world scenarios. Early works built a visual detector to guide and re-rank image captions with a global similarity. The work of investigates the informativeness of object information (e.g., object frequency) in end-to-end caption generation. Cornia et al. propose controlled caption language grounding through visual regions from the image . Chen et al. rely on scene concept abstract (object, relationship, and attribute) grounded in the image to learn accurate semantics without labels for image caption . More recently, Zhang et al. incorporate different concepts such as scene graph, object, and attribute to learn correct linguistic and visual relevance for better caption language grounding .

Inspired by these works, that uses re-ranking via visual information, Wang et al. and Cornia et al. that explored the benefit of object information in image captioning, Gupta et al. that benefits of language modeling to extract contextualized word representations and the exploitation of the semantic coherency in caption language grounding , we propose a visual grounding-based object scorer to re-rank the most closely related caption with both static and contextualized semantic similarity.

Beam search caption extraction — Baselines. We employ the three most common architectures for caption generation to extract the top beam search. The first baseline is based on the standard shallow CNN-LSTM model . The second, VilBERT is fine-tuned on a total of 12 different vision and language datasets such as caption image retrieval. Finally, the third baseline is a specialized Transformer based caption generator .

## Learning Word to Sentence Visual Semantic

Problem Formulation. Beam search is the dominant method for approximate decoding in structured prediction tasks such as machine translation, speech recognition, and image captioning. The larger beam size allows the model to perform a better exploration of the search space compared to greedy decoding. Our goal is to leverage the visual context information of the image to re-rank the candidate sequences obtained through the beam search, thereby moving the most visually relevant candidate up in the list, as well as moving incorrect candidates down.

Word level similarity. To learn the semantic relation between a caption and its visual context in a word-level manner: first, we employ a bidirectional LSTM based CopyRNN keyphrase extractor to extract keyphrases from the sentence as context. The model is trained on combined pre-processed datasets (1) wikidump (i.e., keyword, short sentence) and (2) SemEval 2017 Task 10 (Keyphrases from scientific publications). Secondly, GloVe is used to compute the cosine similarity between the visual context and its related context. For example, “a woman in a red dress and a black skirt walks down a sidewalk” the model will extract dress and walks, which are the highlights keywords of the caption.

Sentence level similarity. We fine-tune the BERT base model to learn the visual context information. The model learns a dictionary-like relation word-to-sentence paradigm. We use the visual data as context for the sentence via cosine distance.

BERT . BERT achieves remarkable results on many sentence level tasks and especially in the textual semantic similarity task (STS-B). Therefore, we fine-tuned BERT_base on the training dataset, (textual information, 460k captions: 373k for training and 87k for validation) i.e., visual, caption, label [semantically related or not related]), with a binary classification cross-entropy loss function [0,1] where the target is the semantic similarity between the visual and the candidate caption.

Sentence RoBERTa . RoBERTa is an improved version of BERT, and since RoBERTa Large is more robust, we rely on pre-trained SentenceRoBERTa-sts as its yields a better cosine score.

Fusion Similarity Expert. Product of experts (PoE) implies an effort into combining the expertise of each expert (model) in a collaborative manner. It allows each expert to specialize in analyzing one particular aspect of the problem and establishing a judgment based on that aspect. Inspired by PoE , we combined the two experts word and sentence level as last fusion as shown in Figure 1. PoE takes advantage of each expert and can produce much sharper distributions than a single model. The PoE is computed as follows:

MathJax example

$P\left(\mathbf{w} | \theta_{1} .. \theta_{n}\right)=\arg \max _{\mathbf{w}} \frac{\Pi_{m} p_{m}\left(\mathbf{w} | \theta_{m}\right)} {\sum_{\mathbf{c}} \Pi_{m} p_{m}\left(\mathbf{c} | \theta_{m}\right)}$

where $$w$$ is a data vector in the discrete space, $$θm$$ are the parameters of each model $$m$$, $$pm(w|θm)$$ is the probability of $$w$$ under model $$m$$ and $$c$$ are the indexes of all possible vector in the data space.

$\arg \max _{\mathbf{w}} {\Pi_{m} p_{m}\left(\mathbf{w} | \theta_{m}\right)}$

where, $$p_{m}\left(\mathbf{w} | \theta_{m}\right)$$ are the probabilities assigned by each expert to the candidate word $$\mathbf{w}$$.

## Dataset#

We evaluate the proposed approach on two different sized datasets. The idea is to evaluate our method on the most common caption dataset in two scenarios: (1) a shallow model CNN-LSTM (i.e. less data), as well as a system that is trained on a huge amount of data (i.e. Transformer).

♠ Flicker 8K . The dataset contains 8K images, each image has five human label annotated captions. We use this data to train the shallow model (6270 train/1730 test).

♣ COCO . It contains around 120K images, and each image is annotated with five different human label captions. We use the most used split that is provided by (Karpathy test set) , where 5k images are used for testing and 5k for validation, and the rest for model training for the Transformer baseline.

Visual Context Dataset. Since there are many public datasets for caption, they contain no textual visual information such as objects in the image. We enrich the two datasets, mentioned above, with textual visual context information. In particular, to automate visual context generation and without the need for human labeling, we use ResNet152 to extract top-k 3 visual context information for each image in the caption dataset.

Evaluation Metric. We use the official COCO offline evaluation suite, producing several widely used caption quality metrics: BLEU METEOR , ROUGE , CIDEr , and BERTscore or (B-S) .

## Results and Analysis#

We use visual semantic information to re-rank candidate captions produced by out-of-the-box state-of-the-art caption generators. We extract top-20 beam search candidate captions from three different architectures (1) standard CNN+LSTM model , (2) a pre-trained language and vision model VilBERT , fine-tuned on a total of 12 different vision and language datasets such as caption image retrieval, and (3) a specialized caption-based Transformer .

 Model B-1 B-2 B -3 B-4 M R C B-S ♠ Tell-BeamS 0.331 0.159 0.071 0.035 0.093 0.270 0.035 0.8871 Tell+VR_V1-BERT-Glove 0.330 0.158 0.069 0.035 0.095 0.273 0.036 0.8855 Tell+VR_V2-BERT-Glove 0.320 0.154 0.073 0.037 0.099 0.277 0.041 0.8850 Tell+VR_V1-RoBERTa-Glove (sts) 0.313 0.153 0.072 0.037 0.101 0.273 0.036 0.8839 Tell+VR_V2-RoBERTa-Glove (sts) 0.330 0.158 0.069 0.035 0.095 0.273 0.036 0.8869 ♣ Vil-BeamS 0.739 0.577 0.440 0.336 0.271 0.543 1.027 0.9363 Vil+VR_V1-BERT-Glove 0.739 0.576 0.438 0.334 0.273 0.544 1.034 0.9365 Vil+VR_V2-BERT-Glove 0.740 0.578 0.439 0.334 0.273 0.545 1.034 0.9365 Vil+VR_V1-RoBERTa-Glove (sts) 0.738 0.576 0.440 0.335 0.273 0.544 1.036 0.9365 Vil+VR_V2-RoBERTa-Glove (sts) 0.740 0.579 0.442 0.338 0.272 0.545 1.040 0.9366 ♣ Trans-BeamS 0.780 0.631 0.491 0.374 0.278 0.569 1.153 0.9399 Trans+VR_V1-BERT-Glove 0.780 0.629 0.487 0.371 0.278 0.567 1.149 0.9398 Trans+VR_V2-BERT-Glove 0.780 0.630 0.488 0.371 0.278 0.568 1.150 0.9399 Trans+VR_V1-RoBERTa-Glove (sts) 0.779 0.629 0.487 0.370 0.277 0.567 1.145 0.9395 Trans+VR_V2-RoBERTa-Glove(sts) 0.779 0.629 0.487 0.370 0.277 0.567 1.145 0.9395 The ♠ refers to the Fliker 1730 test set, ♣ refers to the COCO Karpathy 5K test set.

Experiments applying different rerankers to each base system are shown in Table 1 (above). The tested rerankers are: (1) VR_BERT+GloVe, which uses BERT and GloVe similarity between the candidate caption and the visual context (top-k V_1 and V_2 during the inference) to obtain the reranked score. (2) VR_RoBERTa+GloVe, which carries out the same procedure using similarity produced by Sentence RoBERTa.

Our re-ranker produced mixed results as the model struggles when the beam search is less diverse. The model is therefore not able to select the most closely related caption to its environmental context as shown in Figure 2, which is a visualization of the final visual beam re-ranking.

Evaluation of Lexical Diversity. As shown in Table 2 (below), we evaluate the model from a lexical diversity perspective. We can conclude that we have (1) more vocabulary, and (2) the Unique word per caption is also improved, even with a lower Type-Token Ratio TTR . (TTR is the number of unique words or types divided by the total number of tokens in a text fragment.).

Although this approach re-ranks higher diversity caption, the improvement is not strong enough to impact the benchmark result positively as shown in Table 1.

 Model Voc TTR Uniq WPC ♠Tell-BeamS 304 0.79 10.4 12.7 Tell+VR-RoBERTa 310 0.82 9.42 13.5 ♣Vil-BeamS 894 0.87 8.05 10.5 Vil+VR-RoBERTa 953 0.85 8.86 10.8 ♣Trans-BeamS 935 0.86 7.44 9.62 Trans+VR-BERT 936 0.86 7.48 8.68 Uniq and WPC columns indicate the average of unique/total words per caption, respectively. The ♠ refers to the Fliker 1730 test set, ♣ refers to the COCO Karpathy 5K test set.

Ablation Study. We performed an ablation study to investigate the effectiveness of each model. As to the proposed architecture, each expert tried to learn different representations in a word and sentence manner. In this experiment, we trained each model separately, as shown in Table 3 (below). The GloVe as a stand-alone performed better than the combined model (and thus, the combined model breaks the accuracy). To investigate this even further we visualized each expert before the fusion layers as shown in Figure 3.

 Model B-4 M R C B-S Trans-BeamS 0.374 0.278 0.569 1.153 0.9399 +VR-RoBERT-GloVe 0.370 0.277 0.567 1.145 0.9395 +VR-BERT-GloVe 0.370 0.371 0.278 1.149 0.9398 +VR-RoBERT-BERT 0.369 0.278 0.567 1.144 0.9395 +VR_V1-GloVe 0.371 0.278 0.568 1.148 0.9398 +VR_V2-GloVe 0.371 0.371 0.278 1.149 0.9398 The color represents Figure 3. ♣ Bottom Figure 3 shows that BERT is not contributing, as GloVe, to the final score.

Limitation. In contrast to CNN-LSTM (♠ top Figure 3), where each expert is contributing to the final decisions, we observed that having a shorter caption (with less context) can influence the BERT similarity score negatively. Therefore, the GloVe dominates as the main expert as shown in Figure 3 (♣ Bottom).

## Conclusion#

In this work, we introduce an approach that overcomes the limitation of beam search and avoids re-training for better accuracy. We proposed a combined word and sentence visual beam search re-ranker. However, we discover that word and sentence similarity disagree with each other when the beam search is less diverse. Our experiments also highlight the usefulness of the model by showing successful cases.

We can also cite external publications.