Women Wearing Lipstick:
Measuring the Bias Between Object and Its Related Gender

Ahmed Sabir and Lluís Padró
Universitat Politècnica de Catalunya, TALP Research Center


In this paper, we investigate the impact of objects on gender bias in image captioning systems. Our results show that only gender-specific objects have a strong gender bias (e.g. woman-lipstick). In addition, we propose a visual semantic-based gender score that measures the degree of bias and can be used as a plug-in for any image captioning system. Our experiments demonstrate the utility of the gender score, since we observe that our score can measure the bias relation between a caption and its related gender; therefore, our score can be used as an additional metric to the existing Object Gender Co-Occ approach.


In this work, we proposed two object-to-gender bias scores: (1) a direct Gender Score, and (2) a [ MASK ] based Gender Score Estimation. For the direct score, the model uses the visual context to predict the degree of related gender-object bias. Additionally, inspired by the Mask Language Model, the model can estimate the Mask gender using the relation between the caption and object information from the image.

For quick start please have a look at the demo

Proposed Approach

The direct Gender Score and the Gender Estimation Score rely on the visual information to predict the related gender or degree of bias as follows:

Gender Score:

Visual ClassifierCLIP
   visual_context_label = 'motor scooter' 
visual_context_prob = 0.2183
Gender Scoreman/woman
     a ____ riding a motorcycle on a road - object bias ratio to-man -> 53.17% 
a ____ riding a motorcycle on a road - object bias ratio to-woman -> 46.82%

Gender Score Estimation:

Visual ClassifierCLIP
   visual_context_label = 'joystick' 
visual_context_prob = 0.2732
Gender Score EstimationMASK
      a  [MASK]  playing a video game in a living room (object bias based prediction:man) 

Experiments & Results

Comparison result (i.e. baseline gender output) between Object Gender Co-Occ and our Gender Score Estimation on the Karpathy split. The proposed score measures gender bias more accurately, particularly when there is a strong gender-to-object relation.

Qualitative results

The proposed score uses the correlation between the visual and its related gender. As shown in the (Left) figure, there is an equal distribution of object-gender (man and woman), which indicate that not all object has a strong bias toward a specific gender. (Right) the figure shows examples of Gender Score Estimation and Gender Object Distance via Cosine Distance. The result shows that (Top) the score balances the bias (as men and women have a similar bias with sport tennis), (Bottom) men strong object bias relation (paddle, surfboard), the model adjusts the women bias while preserving the object gender bias.


The Table below shows that our score has similar results (bias ratio) to the existing Object Gender Co-Occ approach on the most biased objects toward men. Note that TraCLIPS-Reward (CLIPS+CIDEr) inherits biases from RL-CLIPS, resulting in distinct gender predictions and generates caption w/o a specific gender i.e. person, baseball player, etc.

Comparison against GPT-2 and Cosine Distance Score

Comparison result on the test set of the Gender Score bias. (toward women or men ) between two different pre-trained models in training dataset size BLIP 129M (unsupervised) and VilBERT 3.5M. Our proposed visual bias likelihood revision aka Belief Revision (BR) based Gender Score balances the amplified bias as the model gets bigger, the more amplified bias against men or women.

Case study

We also apply our proposed gender score to general tasks such as short text Twitter, we utilized a subset of the Twitter user gender classification dataset. We use a BERT based keyword extractor to extract the biased context from the sentence (e.g. travel-man, woman-family), and we then employ the cloze probability to extract the probability of the context. We observe that some keywords have a strong bias: women are associated with keywords such as novel, beauty, and hometown. Meanwhile, men are more frequently related to words such as gaming, coffee, and inspiration. The table below shows (highlighted in red color) when Gender Score disagrees with human estimation confidence due to gender-context bias.


title={Women Wearing Lipstick: Measuring the Bias Between an Object and Its Related Gender},
author={Sabir, Ahmed and Padr{\'o}, Llu{\'\i}s},
journal={arXiv preprint arXiv:2310.19130},

Contact: Ahmed Sabir