Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned

Architecture

We introduce semantic relations between the visual context in the image and the caption at the word and sentence levels. We propose a joint BERT [9] with GloVe [28] to capture visual semantic similarity. The main components of the proposed Architecture:

Word Level Model

To enable word-level semantics with GloVe, we extract keyphrases [24] from the caption, and we employ the confidence of the classifier in the image to convert the similarity into a probability [30].

Sentence Level Model

We fine-tuned BERT on the Caption datasets (Flickr8k [13] for less data senario, and COCO [18] for regular model e.g. Trasnformer [8]) incorporating the top-k 3 visual context information extracted from each image [11], where the target is the semantic relatedness between the visual and the candidates caption.

Fusion Layers

Inspired by Products of Experts [12], we merged the two experts through a Late Fusion layer. As this work aims to retrieve the closest candidate caption (with its visual context) with the highest probability, the normalization step is unnecessary.

Results

We evaluate the proposed approach on two different size datasets. The idea is to evaluate our approach with (1) a shallow model CNN-LSTM [32] (i.e. less data scenario), and on a system that is trained on a huge amount of data (VilBERT [21] and Transformer[8]). Our re-ranker yielded mixed result (+) improving model accuracy, (−) struggles when dealing with less diverse captions i.e. Transformer baseline

Ablation Study

We performed an ablation study with our worst model (i.e. Transformer baseline) to investigate the effectiveness of each expert. In this experiment, we trained each model separately. The GloVe as a stand-alone performed better than the combined model (and thus, the combined model breaks the accuracy). To investigate this even further we visualized each expert before the fusion layers as shown in the Figure below. The color represents in the Figure below ♣ and shown in the Table that BERT is not contributing, as GloVe, to the final score. The GloVe is dominating to become the expert to the final score for two reasons: (1) short caption, and (2) less diverse beam.

Table 2 Ablation study. We analyze the contribution of each Expert to the final accuracy.
Model	B-4	M	R	C	B-S
Transformer-BeamS [8]	0.374	0.278	0.569	1.153	0.9399
+VR-RoBERT-GloVe	0.370	0.277	0.567	1.145	0.9395
+VR-BERT-GloVe	0.370	0.371	0.278	1.149	0.9398
+VR-RoBERT-BERT	0.369	0.278	0.567	1.144	0.9395
+VR_V1-GloVe	0.371	0.278	0.568	1.148	0.9398
+VR_V2-GloVe	0.371	0.371	0.278	1.149	0.9398
The color represents the actual value BERT and GloVe on 5k ♣ COCO dataset Karpathy test set as shown in the Figure above

Citation

@article{sabir2022word,
  title={Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned},
  author={Sabir, Ahmed},
  journal={arXiv preprint arXiv:2209.12817},
  year={2022}
}

Contact: Ahmed Sabir