Review: Large Language Models Are Biased Because They Are Large Language Models
In this blog I will review Philip Resnik’s position paper, Large Language Models Are Biased Because They Are Large Language Models
Large language models optimize next-token prediction on human text and therefore learn the regularities in that text, including social biases. Post-hoc filters and preference tuning primarily reshape style rather than the learned conditional structure, so apparent improvements often move bias around rather than remove it. Real progress likely requires objectives and architectures that encode normative constraints directly rather than inferring them indirectly.
Let \(\mathcal{D}\) be a distribution over sequences \(x = (x_1,\ldots,x_T)\) generated by human authors. A language model \(p_\theta\) with parameters \(\theta\) is trained by empirical risk minimization with negative log-likelihood loss, equivalently by minimizing \(\mathrm{KL}(\mathcal{D}\,\|\,p_\theta)\):
\[ \theta^\star = \arg\min_{\theta} \ \mathbb{E}_{x \sim \mathcal{D}}\big[-\log p_\theta(x)\big] \quad\Longleftrightarrow\quad \theta^\star = \arg\min_{\theta} \ \mathrm{KL}\!\left(\mathcal{D}\,\|\,p_\theta\right) \]
Token factorization yields an equality of conditionals wherever \(\mathcal{D}\) provides support: \[ p_{\theta^\star}(x_t \mid x_{< t}) \approx \mathcal{D}(x_t \mid x_{< t}) \quad \text{for typical contexts } x_{< t}. \] Introduce a sensitive attribute \(A\) that is recoverable from the context, and a linguistic decision variable \(Y\) such as an occupation token. If the training distribution exhibits a difference \(\mathcal{D}(Y \mid A=a_1, X) \neq \mathcal{D}(Y \mid A=a_2, X)\) for some contexts \(X\), then the model that best approximates \(\mathcal{D}\) necessarily inherits that difference: \[ p_{\theta^\star}(Y \mid A=a, X) \approx \mathcal{D}(Y \mid A=a, X). \]
The core objective aligns the model to the training distribution. If that distribution encodes harmful social regularities, alignment reproduces them. This is structural, not an implementation bug.
Consider prompts of the form “The profession of NAME is …” and define \(Y\) to be the occupation token. Let \(A\) denote gender inferred from the name. Suppose the corpus includes frequencies \(\mathcal{D}(Y{=}\texttt{nurse}\mid A{=}\mathrm{f})=0.20\) and \(\mathcal{D}(Y{=}\texttt{engineer}\mid A{=}\mathrm{f})=0.08\), while \(\mathcal{D}(Y{=}\texttt{nurse}\mid A{=}\mathrm{m})=0.05\) and \(\mathcal{D}(Y{=}\texttt{engineer}\mid A{=}\mathrm{m})=0.22\). Maximum-likelihood training allocates logits so that, in the corresponding contexts, the softmax approximates these conditionals. The gap \(\Delta_{\texttt{nurse}} = p_{\theta^\star}(\texttt{nurse}\mid \mathrm{f}) - p_{\theta^\star}(\texttt{nurse}\mid \mathrm{m}) \approx 0.15\) remains unless the learned distribution departs from the data in a way that increases the training loss.
If a post-hoc safety layer forbids the literal token nurse
, we obtain a projected distribution \(\tilde{p}_{\theta^\star} \propto \Pi(p_{\theta^\star})\). Probability mass migrates to paraphrases such as caregiver or assistant. The literal term drops, but the group-conditioned association remains.
Consider the next token after “The profession of NAME is …”. Suppose counts are as follows.
nurse | engineer | (other) | Total | |
---|---|---|---|---|
female-coded | 200 | 80 | 720 | 1000 |
male-coded | 50 | 220 | 730 | 1000 |
The implied conditionals are 0.20 and 0.08 for female-coded versus 0.05 and 0.22 for male-coded names. A maximum-likelihood model will approximate these values in context and therefore exhibits a positive \(\Delta_{\texttt{nurse}}\) by construction. If the token nurse
is blacklisted, mass shifts to near-synonyms, and the dependency on \(A\) persists under a new label.
Preference optimization augments the objective with a reward model \(r_\phi(x,y)\) and a regularization term that constrains divergence from a reference policy \(p_{\theta_0}\): \[ \max_{\theta} \ \mathbb{E}_{x\sim\mathcal{D}} \ \mathbb{E}_{y\sim p_\theta(\cdot|x)} \big[r_\phi(x,y)\big] \ - \ \beta \, \mathbb{E}_{x\sim\mathcal{D}} \ \mathrm{KL}\!\big(p_\theta(\cdot|x) \,\|\, p_{\theta_0}(\cdot|x)\big). \] This reshapes outputs toward what annotators deem acceptable, but it does not rewrite the underlying conditional structure that the base model learned from \(\mathcal{D}\). Large \(\beta\) keeps the policy close to the base (and its biases); small \(\beta\) departs more but risks utility and factuality trade-offs, while latent dependencies can survive through indirection.
The language modeling loss rewards descriptive fidelity to human usage and does not distinguish correlations that are evidentially relevant from those that are socially impermissible. Without explicit normative structure, models cannot reliably separate statistical regularities from stereotypes that should be resisted.
Forbidden-word rates and toxicity triggers are useful diagnostics but insufficient. If the concern is dependence on a protected variable, measurement should target conditional behavior by holding relevant covariates \(X\) fixed and comparing \(p_\theta(Y\mid A,X)\) across values of \(A\). A stable difference under paraphrase indicates a structural dependency. Aggregated evaluations can hide conditional gaps, so counterfactual flips of \(A\) with semantics held constant are especially informative.
As in decision systems, calibration, parity, and equalized odds cannot generally be satisfied simultaneously when base rates differ. Although language models are not simple classifiers, the analogy clarifies the thesis: higher fidelity to \(\mathcal{D}\) implies higher fidelity to its base-rate differences.
Even if names and pronouns are stripped, proxies abound through topics, dialect, geography, and world knowledge. Protected attributes are often recoverable from intermediate representations learned purely for perplexity, enabling downstream dependence during generation.
To defeat a structural property, the intervention must be structural. One approach modifies the objective by adding constraints that penalize dependence of outputs on \(A\) after conditioning on legitimate features: \[ \mathcal{L}(\theta) \;=\; \mathbb{E}_{x\sim\mathcal{D}}[-\log p_\theta(x)] \;+\; \lambda \, \Psi\!\left( p_\theta(Y\,|\,A,X) \right), \] where \(\Psi\) measures undesired dependence. Another approach uses hybrid architectures in which a language model proposes candidates while causal or logical modules enforce invariances and provide auditability.
The paper does not say that mitigation is pointless. It says that as long as the governing objective remains to reproduce human text as such, improvements to that objective will couple to the biases of that text. Filters and preference shaping can reduce acute harms and are valuable in practice, but they do not convert a descriptive learner into a normatively constrained one.
The fable is a metaphor for optimizing under a fixed nature. The nature of an LLM is to minimize cross-entropy to human text. When it stings, it is not because it malfunctions but because it executes its nature in contexts where that nature conflicts with our values. Changing the outcome with reliability requires changing the nature: either the distribution it mirrors, the metric of mirroring, or the machinery that turns mirroring into action under constraints.
If the objective is to be an accurate model of human language, then structural artifacts of human language will be structural artifacts of the model. When those artifacts are harmful, it is wishful to expect that better mirrors will be less reflective. What is required is a concurrent theory of learning that prioritizes the reasons for predictions and not merely their frequencies, backed by measurements that interrogate conditional behavior rather than surface forms.
Objective design: How can we create practical surrogate objectives that place constraints on \(p_\theta(Y \mid A, X)\)? The goal is to ensure models make decisions for the right reasons, while keeping performance loss minimal.
Representation control: Can we reliably suppress both linear and non-linear recoverability of protected attributes in intermediate model states, while still maintaining strong task performance?
Evaluation at scale: How can we measure group-conditional behavior, including counterfactual flips of \(A\), across diverse tasks without causing prompt leakage or falling into Simpson’s paradox?
Limits of preference learning: When and why do methods like RLHF or direct preference optimization fail to change structural dependencies? What additional signals are needed for success?