Ahmed Sabir1, Markus Kängsepp1, Rajesh Sharma1,2
1 Institute of Computer Science, University of Tartu, Estonia
2 School of AI and CS, Plaksha University, India
The increased use of Large Language Models (LLMs) in sensitive domains leads to growing interest in how their confidence scores correspond to fairness and bias. This study examines the alignment between LLM-predicted confidence and human-annotated bias judgments. Focusing on gender bias, the research investigates probability confidence calibration in contexts involving gendered pronoun resolution. The goal is to evaluate if calibration metrics based on predicted confidence scores effectively capture fairness-related disparities in LLMs. The results show that, among the six state-of-the-art models, Gemma-2 demonstrates the worst calibration according to the gender bias benchmark. The primary contribution of this work is a fairness-aware evaluation of LLMs’ confidence calibration, offering guidance for ethical deployment. In addition, we introduce a new calibration metric, Gender-ECE, designed to measure gender disparities in resolution tasks.
Calibration metrics include ECE, MacroCE, ICE, and the Brier score, which are commonly used to evaluate calibration in LLMs. These metrics typically evaluate binary prediction outcomes — in our case, two sentence completions: one with a male pronoun and one with a female pronoun. However, aggregate calibration measures such as ECE do not provide information about how the model behaves differently with respect to these two outcomes, i.e., male versus female pronouns. To address this limitation, we propose Gender-Aware Group ECE, a metric that provides explicit information about calibration disparities across gendered pronouns.
$$ \text{Gender-ECE} = \frac{1}{2} (\text{ECE}_{\text{male}} + \text{ECE}_{\text{female}}) $$
$$ \text{ECE}_{\text{male/female}} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|, \quad \forall \hat{y}_i \in \{1,0\} $$
The primary focus of the analysis is to assess the models’ bias and confidence in predicting pronouns at the end of sentences, GenderLex dataset (last cloze), meaning that the model has access to the full context of the sentence before scoring a biased pronoun.
| Model | Standard Calibration Metrics | Gender-ECE | Human | |||||
|---|---|---|---|---|---|---|---|---|
| ECE | MacroCE | ICE | Brier Score | Group | M | F | ||
| GPT-J-6B | 0.076 | 0.453 | 0.374 | 0.432 | 0.076 | 0.085 | 0.066 | 0.715 |
| LLAMA-3.1-8B | 0.111 | 0.466 | 0.371 | 0.446 | 0.111 | 0.112 | 0.109 | 0.727 |
| Gemma-2-9B | 0.327 | 0.493 | 0.390 | 0.559 | 0.267 | 0.330 | 0.204 | 0.617 |
| Qwen2.5-7B | 0.106 | 0.476 | 0.422 | 0.385 | 0.107 | 0.052 | 0.162 | 0.637 |
| Falcon-3-7B | 0.161 | 0.491 | 0.449 | 0.356 | 0.149 | 0.081 | 0.217 | 0.605 |
| DeepSeek-8B | 0.085 | 0.461 | 0.369 | 0.470 | 0.090 | 0.074 | 0.106 | 0.686 |
Finding: GPT-J-6B exhibits the best calibration (lowest ECE), while Gemma-2-9B performs the worst overall, consistently producing incorrect outcomes with a high disparity toward the female group.
The main objective is to evaluate the model’s confidence in coreference resolution (WinoBias dataset) by measuring the model’s preferences when resolving pronouns.
| Model | Standard Metrics | Gender-ECE | Human | |||||
|---|---|---|---|---|---|---|---|---|
| ECE | MacroCE | ICE | Brier | Group | M | F | ||
| GPT-J-6B | 0.157 | 0.444 | 0.356 | 0.481 | 0.164 | 0.150 | 0.179 | 0.686 |
| LLAMA-3.1-8B | 0.193 | 0.460 | 0.377 | 0.460 | 0.214 | 0.179 | 0.249 | 0.662 |
| Gemma-2-9B | 0.429 | 0.490 | 0.482 | 0.467 | 0.297 | 0.438 | 0.156 | 0.509 |
| Qwen2.5-7B | 0.234 | 0.442 | 0.362 | 0.510 | 0.190 | 0.259 | 0.121 | 0.630 |
| Falcon-3-7B | 0.154 | 0.452 | 0.357 | 0.487 | 0.149 | 0.160 | 0.138 | 0.684 |
| DeepSeek-8B | 0.240 | 0.478 | 0.382 | 0.496 | 0.218 | 0.255 | 0.182 | 0.648 |
Finding: Gemma-2-9B exhibits the worst calibration overall and a preference for male pronouns. GPT-J-6B and Falcon3-7B are the fairest models, with the lowest Gender-ECE and minimal differences between genders.
We investigate model behavior under gender-neutral conditions by replacing occupation titles in the GenderLex dataset with gender-neutral terms such as person and someone.
| Metric | GPT-J-6B | LLAMA-3.1-8B | DeepSeek-R1-8B | Gemma-2-9B | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Someone | Person | Occ | Someone | Person | Occ | Someone | Person | Occ | Someone | Person | Occ | |
| ECE | 0.144 | 0.063 | 0.076 | 0.134 | 0.138 | 0.111 | 0.139 | 0.130 | 0.085 | 0.364 | 0.367 | 0.327 |
| MacroCE | 0.484 | 0.476 | 0.453 | 0.483 | 0.478 | 0.466 | 0.482 | 0.481 | 0.461 | 0.493 | 0.493 | 0.494 |
| ICE | 0.454 | 0.442 | 0.374 | 0.436 | 0.445 | 0.371 | 0.452 | 0.443 | 0.369 | 0.397 | 0.393 | 0.390 |
| Brier Score | 0.331 | 0.341 | 0.432 | 0.358 | 0.348 | 0.446 | 0.352 | 0.361 | 0.470 | 0.560 | 0.581 | 0.574 |
| Group-ECE | 0.138 | 0.077 | 0.076 | 0.132 | 0.111 | 0.138 | 0.138 | 0.137 | 0.097 | 0.450 | 0.351 | 0.267 |
| + Male | 0.106 | 0.031 | 0.085 | 0.115 | 0.071 | 0.112 | 0.048 | 0.065 | 0.074 | 0.363 | 0.367 | 0.330 |
| + Female | 0.170 | 0.122 | 0.066 | 0.148 | 0.109 | 0.109 | 0.228 | 0.210 | 0.216 | 0.536 | 0.335 | 0.204 |
| Human alignment | 0.598 | 0.616 | 0.715 | 0.638 | 0.598 | 0.727 | 0.578 | 0.596 | 0.687 | 0.603 | 0.606 | 0.618 |
Finding: Explicit role titles improve model calibration, whereas gender-neutral terms increase calibration error due to ambiguity.
Our findings show that most models are poorly calibrated, remaining over-confident despite frequent prediction errors. We therefore apply Beta calibration (Kull, 2017) using a balanced calibration set (50:50), which reduces ECE and better aligns predicted confidences with empirical accuracy.
This work was supported by the Estonian Research Council grant “Developing human-centric digital solutions” (TEM TA120) and by the Estonian Centre of Excellence in Artificial Intelligence (EXAI), funded by the Estonian Ministry of Education and Research and co-funded by the European Union and the Estonian Research Council via project TEM TA119. It was funded by the EU H2020 program under the SoBigData++ project (grant agreement No. 871042) and partially funded by the HAMISON project.
@article{sabir2026confidence,
title={The Confidence Trap: Gender Bias and Predictive Certainty in LLMs},
author={Sabir, Ahmed and K{\"a}ngsepp, Markus and Sharma, Rajesh},
journal={arXiv preprint arXiv:2601.07806},
year={2026}
}