Belief Revision based Caption Re-ranker with Visual Semantic Information |
Ahmed Sabir1, Francesc Moreno-Noguer2, Pranava Madhyastha3, Lluís Padró1 |
Universitat Politècnica de Catalunya, TALP Research Center1 |
Institut de Robòtica i Informàtica Industrial, CSIC-UPC2 |
City, University of London3 |
|
|
|
In this work, we focus on improving the captions generated by image-caption generation systems. We propose a novel re-ranking approach that leverages visual-semantic measures to identify the ideal caption that maximally captures the visual information in the image. Our re-ranker utilizes the Belief Revision framework (Blok et. al. 2003) to calibrate the original likelihood of the top-n captions by explicitly exploiting the semantic relatedness between the depicted caption and the visual context. Our experiments demonstrate the utility of our approach, where we observe that our re-ranker can enhance the performance of a typical image-captioning system without the necessity of any additional training or fine-tuning.
The Belief revision is a conditional probability model which assumes that the preliminary probability finding is revised to the extent warranted by the hypothesis proof. \(\text{P}(w \mid c)= \text{P}(w)^{\alpha}\) where the main components of hypothesis revision as caption visual semantics re-ranker: 1. Hypothesis (caption candidates beam search) \(\text{P}(w)\) initialized by common observation (i.e. language model). 2. Informativeness \(1-\text{P}(c)\) of the visual context from the image. 3. Similarities \(\alpha=\left[\frac{1 - \text{sim}(w, c)}{1+\text{sim}(w, c)}\right]^{1-\text{P}(c)}\) the relatedness between the two concepts: (1) visual context from the image \(\text{P}(c)\), and (2) the hypothesis i.e. candidate caption \(\text{P}(w)\), that condition to the degree of the informativeness of the visual information. Here is a huggingface demo to show the Visual hypothesis revision |
ExampleIn this example, we extract top-20 beam search from SOTA caption transformer and re-ranked them with Visual Belief Revision. a longhorn cow with horns standing in a field two bulls standing next to each other two bulls with horns standing next to each other two bulls with horns standing next to each other two bulls with horns standing next to each other two bulls with horns standing next to each other two bulls with horns standing next to each other two bulls with horns standing next to each other two bulls with horns standing next to each other two bulls with horns standing next to each other a couple of bulls standing next to each other a couple of bulls standing next to each other two long horn bulls standing next to each other two long horn bulls standing next to each other two long horn bulls standing next to each other two long horn bulls standing next to each other two long horn bulls standing next to each other two long horn bulls standing next to each other two long horn bulls standing next to each other two long horn bulls standing next to each other |
Visual ContextResNet/CLIP
COCO_val2014_000000235692.jpg [('ox', 0.49095494)] |
Visual Belief Revisionre-ranking
two bulls standing next to each other 0.31941289259462063 a couple of bulls standing next to each other 0.2858426977047663 two bulls with horns standing next to each other 0.26350009525262974 two long horn bulls standing next to each other 0.24074783064577798 a longhorn cow with horns standing in a field 0.0.03975113398536263 |
@article{sabir2022belief, title={Belief Revision based Caption Re-ranker with Visual Semantic Information}, author={Sabir, Ahmed and Moreno-Noguer, Francesc and Madhyastha, Pranava and Padr{\'o}, Llu{\'\i}s}, journal={arXiv preprint arXiv:2209.08163}, year={2022} }
Contact: Ahmed Sabir