Visual Semantic Relatedness Dataset for Image Captioning

Ahmed Sabir, Francesc Moreno-Noguer, Lluís Padró


Modern image captaining relies heavily on extracting knowledge, from images such as objects, to capture the concept of a static story in an image. In this paper, we propose a textual visual context dataset for captioning, where the publicly available dataset COCO Captions (Lin et al., 2014) has been extended with information about the scene (such as objects in the image). Since this information has textual form, it can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems, either as an end-to-end training strategy or a post-processing based approach.


We enrich COCO Captions with textual Visual Context information. We use ResNet152, CLIP, and Faster R-CNN to extract object information for each image. We use three filter approaches to ensure the quality of the dataset (1) Threshold: to filter out predictions where the object classifier is not confident enough, and (2) semantic alignment with semantic similarity to remove duplicated objects. (3) semantic relatedness score as soft-label: to guarantee the visual context and caption have a strong relation. In particular, we use Sentence-RoBERTa via cosine similarity to give a soft score, and then we use a threshold to annotate the final label (if th ≥ 0.2, 0.3, 0.4 then 1,0). Finally, to take advantage of the visual overlap between caption and visual context, and to extract global information, we use BERT followed by a shallow CNN (Kim, 2014) to estimate the visual relatedness score.

For quick start please have a look at the demo

Resulting Dataset and Proposed Approach

We also proposed a data filtering strategy and visual semantic model to estimate the degree of the relatedness between the caption and its related visual context in the image.

Training dataCOCO
            visual context, caption descriptions
            umbrella dress human face, a woman with an umbrella near the sea.
            bathtub tub, this is a bathroom with a jacuzzi shower sink and toilet.
            snowplow shovel, the fire hydrant is partially buried under the snow.
            desktop computer monitor, a computer with a flower as its background sits on a desk.
            pitcher ballplayer, a baseball player preparing to throw the ball.
            groom restaurant, a black and white picture of a centerpiece to a table at a wedding.
            visual context, caption descriptions, overlapping information
            pole streetsign flagpole, a house that has a pole with a sign on it,{'pole'}.
            stove microwave refrigerator, an older stove sits in the kitchen next to a bottle of cleaner,{'stove'}.
            racket tennis ball ballplayer, a tennis player swinging a racket at a ball,{'tennis', 'racket', 'ball'}.
            grocery store dining table restaurant, a table is full of different kinds of food and drinks,{'table'}.
Gender NeutralCOCO
            visual context, caption descriptions
            pizza, a person cutting a pizza with a fork and knife.
            suit, a person in a suit and tie sitting with his hands between his legs.
            paddle, a person riding a colorful surfboard in the water.
            ballplayer, a young person in a batting stance in a baseball game.


  title={Visual Semantic Relatedness Dataset for Image Captioning},
  author={Sabir, Ahmed and Moreno-Noguer, Francesc and Padr{\'o}, Llu{\'\i}s},
  journal={arXiv preprint arXiv:2301.08784},

Contact: Ahmed Sabir