MathJax support

In this blog I will review Philip Resnik’s position paper, Large Language Models Are Biased Because They Are Large Language Models . The central claim is that bias is a structural property of contemporary language modeling: if the objective is to approximate the distribution of human text, then improvements on that objective will improve the model’s reproduction of whatever regularities are present in human text, including the ones we would consider harmful. The paper is not an empirical catalog of failure cases; it is a conceptual analysis of why those failures persist despite substantial engineering attention, and why they re-emerge in new forms even when surface manifestations are filtered away.

Key Takeaways

Large language models optimize next-token prediction on human text and therefore learn the regularities in that text, including social biases. Post-hoc filters and preference tuning primarily reshape style rather than the learned conditional structure, so apparent improvements often move bias around rather than remove it. Real progress likely requires objectives and architectures that encode normative constraints directly rather than inferring them indirectly.

Problem framing and notation

Let \(\mathcal{D}\) be a distribution over sequences \(x = (x_1,\ldots,x_T)\) generated by human authors. A language model \(p_\theta\) with parameters \(\theta\) is trained by empirical risk minimization with negative log-likelihood loss, equivalently by minimizing \(\mathrm{KL}(\mathcal{D}\,\|\,p_\theta)\):

\[ \theta^\star = \arg\min_{\theta} \ \mathbb{E}_{x \sim \mathcal{D}}\big[-\log p_\theta(x)\big] \quad\Longleftrightarrow\quad \theta^\star = \arg\min_{\theta} \ \mathrm{KL}\!\left(\mathcal{D}\,\|\,p_\theta\right) \]

Token factorization yields an equality of conditionals wherever \(\mathcal{D}\) provides support: \[ p_{\theta^\star}(x_t \mid x_{< t}) \approx \mathcal{D}(x_t \mid x_{< t}) \quad \text{for typical contexts } x_{< t}. \] Introduce a sensitive attribute \(A\) that is recoverable from the context, and a linguistic decision variable \(Y\) such as an occupation token. If the training distribution exhibits a difference \(\mathcal{D}(Y \mid A=a_1, X) \neq \mathcal{D}(Y \mid A=a_2, X)\) for some contexts \(X\), then the model that best approximates \(\mathcal{D}\) necessarily inherits that difference: \[ p_{\theta^\star}(Y \mid A=a, X) \approx \mathcal{D}(Y \mid A=a, X). \]

Interpretation

The core objective aligns the model to the training distribution. If that distribution encodes harmful social regularities, alignment reproduces them. This is structural, not an implementation bug.

A worked miniature corpus and its consequences

Consider prompts of the form “The profession of NAME is …” and define \(Y\) to be the occupation token. Let \(A\) denote gender inferred from the name. Suppose the corpus includes frequencies \(\mathcal{D}(Y{=}\texttt{nurse}\mid A{=}\mathrm{f})=0.20\) and \(\mathcal{D}(Y{=}\texttt{engineer}\mid A{=}\mathrm{f})=0.08\), while \(\mathcal{D}(Y{=}\texttt{nurse}\mid A{=}\mathrm{m})=0.05\) and \(\mathcal{D}(Y{=}\texttt{engineer}\mid A{=}\mathrm{m})=0.22\). Maximum-likelihood training allocates logits so that, in the corresponding contexts, the softmax approximates these conditionals. The gap \(\Delta_{\texttt{nurse}} = p_{\theta^\star}(\texttt{nurse}\mid \mathrm{f}) - p_{\theta^\star}(\texttt{nurse}\mid \mathrm{m}) \approx 0.15\) remains unless the learned distribution departs from the data in a way that increases the training loss.

The projection view of filtering

If a post-hoc safety layer forbids the literal token nurse, we obtain a projected distribution \(\tilde{p}_{\theta^\star} \propto \Pi(p_{\theta^\star})\). Probability mass migrates to paraphrases such as caregiver or assistant. The literal term drops, but the group-conditioned association remains.

Tiny numeric toy example

Setup

Consider the next token after “The profession of NAME is …”. Suppose counts are as follows.

	nurse	engineer	(other)	Total
female-coded	200	80	720	1000
male-coded	50	220	730	1000

The implied conditionals are 0.20 and 0.08 for female-coded versus 0.05 and 0.22 for male-coded names. A maximum-likelihood model will approximate these values in context and therefore exhibits a positive \(\Delta_{\texttt{nurse}}\) by construction. If the token nurse is blacklisted, mass shifts to near-synonyms, and the dependency on \(A\) persists under a new label.

Why preference-based tuning and RLHF are not sufficient

Preference optimization augments the objective with a reward model \(r_\phi(x,y)\) and a regularization term that constrains divergence from a reference policy \(p_{\theta_0}\): \[ \max_{\theta} \ \mathbb{E}_{x\sim\mathcal{D}} \ \mathbb{E}_{y\sim p_\theta(\cdot|x)} \big[r_\phi(x,y)\big] \ - \ \beta \, \mathbb{E}_{x\sim\mathcal{D}} \ \mathrm{KL}\!\big(p_\theta(\cdot|x) \,\|\, p_{\theta_0}(\cdot|x)\big). \] This reshapes outputs toward what annotators deem acceptable, but it does not rewrite the underlying conditional structure that the base model learned from \(\mathcal{D}\). Large \(\beta\) keeps the policy close to the base (and its biases); small \(\beta\) departs more but risks utility and factuality trade-offs, while latent dependencies can survive through indirection.

Normative vs. descriptive generalization

The language modeling loss rewards descriptive fidelity to human usage and does not distinguish correlations that are evidentially relevant from those that are socially impermissible. Without explicit normative structure, models cannot reliably separate statistical regularities from stereotypes that should be resisted.

Measurement: detecting structural bias

Forbidden-word rates and toxicity triggers are useful diagnostics but insufficient. If the concern is dependence on a protected variable, measurement should target conditional behavior by holding relevant covariates \(X\) fixed and comparing \(p_\theta(Y\mid A,X)\) across values of \(A\). A stable difference under paraphrase indicates a structural dependency. Aggregated evaluations can hide conditional gaps, so counterfactual flips of \(A\) with semantics held constant are especially informative.

Relation to fairness criteria and impossibility results

As in decision systems, calibration, parity, and equalized odds cannot generally be satisfied simultaneously when base rates differ. Although language models are not simple classifiers, the analogy clarifies the thesis: higher fidelity to \(\mathcal{D}\) implies higher fidelity to its base-rate differences.

Data, representation, and recoverability of protected attributes

Even if names and pronouns are stripped, proxies abound through topics, dialect, geography, and world knowledge. Protected attributes are often recoverable from intermediate representations learned purely for perplexity, enabling downstream dependence during generation.

What structural change would require

To defeat a structural property, the intervention must be structural. One approach modifies the objective by adding constraints that penalize dependence of outputs on \(A\) after conditioning on legitimate features: \[ \mathcal{L}(\theta) \;=\; \mathbb{E}_{x\sim\mathcal{D}}[-\log p_\theta(x)] \;+\; \lambda \, \Psi\!\left( p_\theta(Y\,|\,A,X) \right), \] where \(\Psi\) measures undesired dependence. Another approach uses hybrid architectures in which a language model proposes candidates while causal or logical modules enforce invariances and provide auditability.

What the paper does and does not claim

The paper does not say that mitigation is pointless. It says that as long as the governing objective remains to reproduce human text as such, improvements to that objective will couple to the biases of that text. Filters and preference shaping can reduce acute harms and are valuable in practice, but they do not convert a descriptive learner into a normatively constrained one.

The scorpion and the frog, reinterpreted technically

The fable is a metaphor for optimizing under a fixed nature. The nature of an LLM is to minimize cross-entropy to human text. When it stings, it is not because it malfunctions but because it executes its nature in contexts where that nature conflicts with our values. Changing the outcome with reliability requires changing the nature: either the distribution it mirrors, the metric of mirroring, or the machinery that turns mirroring into action under constraints.

Conclusion

If the objective is to be an accurate model of human language, then structural artifacts of human language will be structural artifacts of the model. When those artifacts are harmful, it is wishful to expect that better mirrors will be less reflective. What is required is a concurrent theory of learning that prioritizes the reasons for predictions and not merely their frequencies, backed by measurements that interrogate conditional behavior rather than surface forms.

Open Research Questions

Objective design: How can we create practical surrogate objectives that place constraints on \(p_\theta(Y \mid A, X)\)? The goal is to ensure models make decisions for the right reasons, while keeping performance loss minimal.

Representation control: Can we reliably suppress both linear and non-linear recoverability of protected attributes in intermediate model states, while still maintaining strong task performance?

Evaluation at scale: How can we measure group-conditional behavior, including counterfactual flips of \(A\), across diverse tasks without causing prompt leakage or falling into Simpson’s paradox?

Limits of preference learning: When and why do methods like RLHF or direct preference optimization fail to change structural dependencies? What additional signals are needed for success?

Review: Large Language Models Are Biased Because They Are Large Language Models