Visual Semantic Relatedness Dataset for Image Captioning

Ahmed Sabir1, Francesc Moreno-Noguer2, Lluís Padró1
Universitat Politècnica de Catalunya, TALP Research Center1
Institut de Robòtica i Informàtica Industrial, CSIC-UPC2


Modern image captioning system relies heavily on extracting knowledge from images to capture the concept of a static story. In this paper, we propose a textual visual context dataset for captioning, in which the publicly available dataset COCO Captions (Lin et al., 2014) has been extended with information about the scene (such as objects in the image). Since this information has a textual form, it can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems, either as an end-to-end training strategy or a post-processing based approach.


We enrich COCO Captions with textual Visual Context information. We use ResNet152, CLIP, and Faster R-CNN to extract object information for each image. We use three filter approaches to ensure the quality of the dataset (1) Threshold: to filter out predictions where the object classifier is not confident enough, and (2) semantic alignment with semantic similarity to remove duplicated objects. (3) semantic relatedness score as soft-label: to guarantee the visual context and caption have a strong relation. In particular, we use Sentence-RoBERTa via cosine similarity to give a soft score, and then we use a threshold to annotate the final label (if th ≥ 0.2, 0.3, 0.4 then 1,0).

For quick start please have a look at the demo

Proposed Approach

We also propose a strategy to estimate the most closely related/not-related visual concepts using the caption description via BERT-CNN.

Resulting Dataset

We rely on COCO-Captions dataset to extract the visual context, and we propose two visual context datasets:


            visual context, caption descriptions
            umbrella dress human face, a woman with an umbrella near the sea.
            bathtub tub, this is a bathroom with a jacuzzi shower sink and toilet.
            snowplow shovel, the fire hydrant is partially buried under the snow.
            desktop computer monitor, a computer with a flower as its background sits on a desk.
            pitcher ballplayer, a baseball player preparing to throw the ball.
            groom restaurant, a black and white picture of a centerpiece to a table at a wedding.
            visual context, caption descriptions, overlapping information
            pole streetsign flagpole, a house that has a pole with a sign on it,{'pole'}.
            stove microwave refrigerator, an older stove sits in the kitchen next to a bottle of cleaner,{'stove'}.
            racket tennis ball ballplayer, a tennis player swinging a racket at a ball,{'tennis', 'racket', 'ball'}.
            grocery store dining table restaurant, a table is full of different kinds of food and drinks,{'table'}.

Another task that can benefit from the proposed dataset is investigating the contribution of the visual context to gender bias. We also propose gender neutral dataset.

Gender NeutralCOCO
            visual context, caption descriptions
            pizza, a person cutting a pizza with a fork and knife.
            suit, a person in a suit and tie sitting with his hands between his legs.
            paddle, a person riding a colorful surfboard in the water.
            ballplayer, a young person in a batting stance in a baseball game.
Visual + person + man + woman m w to-m
Clothing 3950 3360 1490 .85 .37 .69
footwear 2810 1720 220 .61 .07 .88
racket 1360 440 150 .32 .11 .74
surfboard 820 80 10 .09 .01 .88
tennis 140 200 60 1.4 .42 .76
motorcycle 480 40 20 .08 .04 .66
Frequency count of object + gender in the proposed dataset.

The dataset, in most cases, has more gender-neutral person than gender bias toward men or women. However, the dataset is similar to COCO, a gender bias dataset to-ward men.


  title={Visual Semantic Relatedness Dataset for Image Captioning},
  author={Sabir, Ahmed and Moreno-Noguer, Francesc and Padr{\'o}, Llu{\'\i}s},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},

Contact: Ahmed Sabir