The Confidence Trap: Gender Bias and Predictive Certainty in LLMs

Ahmed Sabir1, Markus Kängsepp1, Rajesh Sharma1,2

1 Institute of Computer Science, University of Tartu, Estonia
2 School of AI and CS, Plaksha University, India

Abstract

The increased use of Large Language Models (LLMs) in sensitive domains leads to growing interest in how their confidence scores correspond to fairness and bias. This study examines the alignment between LLM-predicted confidence and human-annotated bias judgments. Focusing on gender bias, the research investigates probability confidence calibration in contexts involving gendered pronoun resolution. The goal is to evaluate if calibration metrics based on predicted confidence scores effectively capture fairness-related disparities in LLMs. The results show that, among the six state-of-the-art models, Gemma-2 demonstrates the worst calibration according to the gender bias benchmark. The primary contribution of this work is a fairness-aware evaluation of LLMs’ confidence calibration, offering guidance for ethical deployment. In addition, we introduce a new calibration metric, Gender-ECE, designed to measure gender disparities in resolution tasks.

Gender-ECE

Calibration metrics include ECE, MacroCE, ICE, and the Brier score, which are commonly used to evaluate calibration in LLMs. These metrics typically evaluate binary prediction outcomes — in our case, two sentence completions: one with a male pronoun and one with a female pronoun. However, aggregate calibration measures such as ECE do not provide information about how the model behaves differently with respect to these two outcomes, i.e., male versus female pronouns. To address this limitation, we propose Gender-Aware Group ECE, a metric that provides explicit information about calibration disparities across gendered pronouns.

$$ \text{Gender-ECE} = \frac{1}{2} (\text{ECE}_{\text{male}} + \text{ECE}_{\text{female}}) $$

$$ \text{ECE}_{\text{male/female}} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|, \quad \forall \hat{y}_i \in \{1,0\} $$

Experiment: Last Cloze Task

The primary focus of the analysis is to assess the models’ bias and confidence in predicting pronouns at the end of sentences, GenderLex dataset (last cloze), meaning that the model has access to the full context of the sentence before scoring a biased pronoun.

Model Standard Calibration Metrics Gender-ECE Human
ECE MacroCE ICE Brier Score Group M F
GPT-J-6B 0.076 0.453 0.374 0.432 0.076 0.085 0.066 0.715
LLAMA-3.1-8B 0.111 0.466 0.371 0.446 0.111 0.112 0.109 0.727
Gemma-2-9B 0.327 0.493 0.390 0.559 0.267 0.330 0.204 0.617
Qwen2.5-7B 0.106 0.476 0.422 0.385 0.107 0.052 0.162 0.637
Falcon-3-7B 0.161 0.491 0.449 0.356 0.149 0.081 0.217 0.605
DeepSeek-8B 0.085 0.461 0.369 0.470 0.090 0.074 0.106 0.686
Calibration histograms for different models

Finding: GPT-J-6B exhibits the best calibration (lowest ECE), while Gemma-2-9B performs the worst overall, consistently producing incorrect outcomes with a high disparity toward the female group.

Experiment: Conference Resolution

The main objective is to evaluate the model’s confidence in coreference resolution (WinoBias dataset) by measuring the model’s preferences when resolving pronouns.

Model Standard Metrics Gender-ECE Human
ECE MacroCE ICE Brier Group M F
GPT-J-6B 0.157 0.444 0.356 0.481 0.164 0.150 0.179 0.686
LLAMA-3.1-8B 0.193 0.460 0.377 0.460 0.214 0.179 0.249 0.662
Gemma-2-9B 0.429 0.490 0.482 0.467 0.297 0.438 0.156 0.509
Qwen2.5-7B 0.234 0.442 0.362 0.510 0.190 0.259 0.121 0.630
Falcon-3-7B 0.154 0.452 0.357 0.487 0.149 0.160 0.138 0.684
DeepSeek-8B 0.240 0.478 0.382 0.496 0.218 0.255 0.182 0.648
Calibration histograms for the conference resolution task

Finding: Gemma-2-9B exhibits the worst calibration overall and a preference for male pronouns. GPT-J-6B and Falcon3-7B are the fairest models, with the lowest Gender-ECE and minimal differences between genders.

Experiment: Gender-Neutral

We investigate model behavior under gender-neutral conditions by replacing occupation titles in the GenderLex dataset with gender-neutral terms such as person and someone.

Metric GPT-J-6B LLAMA-3.1-8B DeepSeek-R1-8B Gemma-2-9B
Someone Person Occ Someone Person Occ Someone Person Occ Someone Person Occ
ECE 0.144 0.063 0.076 0.134 0.138 0.111 0.139 0.130 0.085 0.364 0.367 0.327
MacroCE 0.484 0.476 0.453 0.483 0.478 0.466 0.482 0.481 0.461 0.493 0.493 0.494
ICE 0.454 0.442 0.374 0.436 0.445 0.371 0.452 0.443 0.369 0.397 0.393 0.390
Brier Score 0.331 0.341 0.432 0.358 0.348 0.446 0.352 0.361 0.470 0.560 0.581 0.574
Group-ECE 0.138 0.077 0.076 0.132 0.111 0.138 0.138 0.137 0.097 0.450 0.351 0.267
+ Male 0.106 0.031 0.085 0.115 0.071 0.112 0.048 0.065 0.074 0.363 0.367 0.330
+ Female 0.170 0.122 0.066 0.148 0.109 0.109 0.228 0.210 0.216 0.536 0.335 0.204
Human alignment 0.598 0.616 0.715 0.638 0.598 0.727 0.578 0.596 0.687 0.603 0.606 0.618
Calibration histograms for gender-neutral prompts (someone/person/occupation)

Finding: Explicit role titles improve model calibration, whereas gender-neutral terms increase calibration error due to ambiguity.

Calibration

Our findings show that most models are poorly calibrated, remaining over-confident despite frequent prediction errors. We therefore apply Beta calibration (Kull, 2017) using a balanced calibration set (50:50), which reduces ECE and better aligns predicted confidences with empirical accuracy.

Calibration comparison histograms

Acknowledgment

This work was supported by the Estonian Research Council grant “Developing human-centric digital solutions” (TEM TA120) and by the Estonian Centre of Excellence in Artificial Intelligence (EXAI), funded by the Estonian Ministry of Education and Research and co-funded by the European Union and the Estonian Research Council via project TEM TA119. It was funded by the EU H2020 program under the SoBigData++ project (grant agreement No. 871042) and partially funded by the HAMISON project.

Citation

@article{sabir2026confidence,
  title={The Confidence Trap: Gender Bias and Predictive Certainty in LLMs},
  author={Sabir, Ahmed and K{\"a}ngsepp, Markus and Sharma, Rajesh},
  journal={arXiv preprint arXiv:2601.07806},
  year={2026}
}
        

This project page was inspired by the Nerfies project page. You are free to borrow the source code of this website.