Belief Revision based Caption Re-ranker with Visual Semantic Information

Abstract

In this work, we focus on improving the captions generated by image-caption generation systems. We propose a novel re-ranking approach that leverages visual-semantic measures to identify the ideal caption that maximally captures the visual information in the image. Our re-ranker utilizes the Belief Revision framework (Blok et. al. 2003) to calibrate the original likelihood of the top-n captions by explicitly exploiting the semantic relatedness between the depicted caption and the visual context. Our experiments demonstrate the utility of our approach, where we observe that our re-ranker can enhance the performance of a typical image-captioning system without the necessity of any additional training or fine-tuning.

Visual Re-ranking with Belief Revision

The Belief revision is a conditional probability model which assumes that the preliminary probability finding is revised to the extent warranted by the hypothesis proof.

\(\text{P}(w \mid c)= \text{P}(w)^{\alpha}\)

where the main components of hypothesis revision as caption visual semantics re-ranker:

1. Hypothesis (caption candidates beam search) \(\text{P}(w)\) initialized by common observation (i.e. language model).

2. Informativeness \(1-\text{P}(c)\) of the visual context from the image.

3. Similarities \(\alpha=\left[\frac{1 - \text{sim}(w, c)}{1+\text{sim}(w, c)}\right]^{1-\text{P}(c)}\) the relatedness between the two concepts: (1) visual context from the image \(\text{P}(c)\), and (2) the hypothesis i.e. candidate caption \(\text{P}(w)\), that condition to the degree of the informativeness of the visual information.

Here is a huggingface demo to show the Visual hypothesis revision

Example

In this example, we extract top-20 beam search from SOTA caption transformer and re-ranked them with Visual Belief Revision.

Beam Search_baseline

            a longhorn cow with horns standing in a field
            two bulls standing next to each other	 
            two bulls with horns standing next to each other	 
            two bulls with horns standing next to each other	 
            two bulls with horns standing next to each other	 
            two bulls with horns standing next to each other	 
            two bulls with horns standing next to each other	 
            two bulls with horns standing next to each other	 
            two bulls with horns standing next to each other	 
            two bulls with horns standing next to each other	 
            a couple of bulls standing next to each other	 
            a couple of bulls standing next to each other	 
            two long horn bulls standing next to each other	 
            two long horn bulls standing next to each other	 
            two long horn bulls standing next to each other	 
            two long horn bulls standing next to each other
            two long horn bulls standing next to each other	
            two long horn bulls standing next to each other
            two long horn bulls standing next to each other
            two long horn bulls standing next to each other

Visual Context_ResNet/CLIP

            COCO_val2014_000000235692.jpg [('ox', 0.49095494)]

Visual Belief Revision_re-ranking

            two bulls standing next to each other 0.31941289259462063
            a couple of bulls standing next to each other 0.2858426977047663
            two bulls with horns standing next to each other 0.26350009525262974
            two long horn bulls standing next to each other 0.24074783064577798
            a longhorn cow with horns standing in a field 0.0.03975113398536263

@article{sabir2022belief, title={Belief Revision based Caption Re-ranker with Visual Semantic Information}, author={Sabir, Ahmed and Moreno-Noguer, Francesc and Madhyastha, Pranava and Padr{\'o}, Llu{\'\i}s}, journal={arXiv preprint arXiv:2209.08163}, year={2022} }