The Verifier Accuracy Paradox: Why Your Perfect Verifier Provides Zero Information

2026-04-29 ~10 min read

A counterintuitive result from our analysis of verifier-filtered self-distillation: the better your verifier, the less information it carries. A verifier with zero false-negatives provides exactly zero discriminatory signal. To sculpt a model toward truth, your verifier must make mistakes. Here is why, and what it implies for foundation-model self-improvement.

The setup

Modern foundation-model training increasingly relies on self-distillation: the model generates candidate outputs, a verifier filters them by quality, and the model retrains on the filtered set. Repeat. The hope is that the filter pulls the model toward truth.

Whether this works is an open theoretical question. Recent work by Hector Zenil (arXiv:2601.05280) gave a sharp condition: the loop converges to truth only if every iteration receives a non-vanishing exogenous grounding signal — formally, a per-round mixing rate $\alpha_t$ with $\inf_t \alpha_t > 0$. Without it, the model becomes a random walk and collapses.

In our work on the Carnot project, we have been studying verifier-based self-distillation specifically. Here, the grounding signal $\alpha_t$ is not abstract: it is exactly what the verifier provides. So the natural question is:

How much information does a verifier inject per round? Specifically, if a verifier $E$ accepts or rejects each candidate output, what is the mutual information $I(E; \mu_P)$ between its output and the true data distribution $\mu_P$?

The naive expectation

Intuitively, a more accurate verifier should provide more information. A perfect verifier — one with zero false-positives and zero false-negatives — should provide the maximum possible signal, since its accept/reject is a noiseless readout of correctness.

The math says the opposite.

The derivation

Let $E: \mathcal{X} \to \{0, 1\}$ be a Boolean verifier evaluating outputs $X$ drawn from the truth distribution $\mu_P$. Define:

False-positive rate (FPR): probability that $E$ accepts an incorrect output.
False-negative rate (FNR): probability that $E$ rejects a correct output.

Now compute the mutual information $I(E(X); X)$ when $X \sim \mu_P$:

$$I(E(X); X) = H(E(X)) - H(E(X) \mid X)$$

Here is the move. Since $E(X)$ is a deterministic function of $X$, the conditional entropy $H(E(X) \mid X) = 0$. So:

$$I(E; \mu_P) = H(E(X))$$

And here is the trap. Under $X \sim \mu_P$, the verifier only sees truth. It never encounters incorrect outputs in this evaluation. So FPR is unobservable — the verifier's behaviour on non-truth simply does not appear in this calculation. The only randomness is whether $E$ correctly accepts the truth, which is governed entirely by FNR:

$$\boxed{\; I(E; \mu_P) = h(\text{FNR}) \;}$$

where $h(p) = -p \log p - (1-p)\log(1-p)$ is the binary entropy function.

This binary entropy is maximized at $\text{FNR} = 0.5$ (giving $h = 1$ bit) and minimized at $\text{FNR} \in \{0, 1\}$ (giving $h = 0$ bits).

The paradox

A perfect verifier provides zero information. When $\text{FNR} = 0$ (verifier always accepts truth), the output $E(X) = 1$ identically, so its entropy is zero, and the verifier carries no signal about $X$. The model learns nothing.

Why? Because the verifier's job is to discriminate. Discrimination requires variance. A "yes-to-everything-true" verifier has no variance under truth. Its filter does nothing — every truth-sample passes — so it cannot reweight the distribution toward anything.

A verifier that occasionally fails on truth, however, induces real variance in its accept/reject pattern. That variance carries bits. Those bits are the grounding signal Zenil's theorem demands.

The bound is sharp: for an $\varepsilon$-accurate verifier (FNR $\leq \varepsilon$), the maximum information per round is $h(\varepsilon)$. For typical safety thresholds:

$\varepsilon = 0.01$ (99% accurate): $h(0.01) \approx 0.081$ bits
$\varepsilon = 0.05$ (95% accurate): $h(0.05) \approx 0.286$ bits
$\varepsilon = 0.10$ (90% accurate): $h(0.10) \approx 0.469$ bits
$\varepsilon = 0.50$ (coin flip): $h(0.50) = 1.000$ bit (theoretical max)

Higher accuracy ⇒ less information. The relationship is not just non-monotonic; it is inverted in the regime that practitioners aim for.

Why this matters

Suppose you want to self-distill a foundation model on a high-stakes task with a $k$-verifier suite. The total information available per round is bounded above by $\sum_i h(\varepsilon_i)$. If your verifiers are individually 95% accurate, you have at most $k \times 0.286$ bits per round.

For a task whose truth distribution has, say, $H(\mu_P) \approx 50$ bits of conditional entropy (a generous upper bound for code-repair-scale tasks), reaching truth would take roughly $50 / (k \times 0.286) \approx 175 / k$ rounds — and that is assuming every bit is independent and useful. With realistic correlation across verifiers, far more.

This is the Boolean Information Bottleneck: any static array of accurate Boolean verifiers has a fundamental information ceiling that grows only logarithmically with the verifier accuracy. Spending more compute on more-accurate verifiers gives diminishing returns. Spending it on more-numerous verifiers helps, but only if their information is independent.

The way out

In Carnot's analysis we found that the architecture's composition rule dictates whether the bottleneck binds. Specifically, the trick is to deliberately make verifiers worse in a particular way.

Consider AND-composing $m$ independent base verifiers: a candidate must pass all of them. The composite verifier's FNR satisfies $\text{FNR}_{\text{AND}} = 1 - \prod_i (1 - \text{FNR}_i)$. For $m = 14$ base verifiers each with $\varepsilon = 0.05$:

$$\text{FNR}_{\text{AND}} = 1 - (0.95)^{14} \approx 0.512$$

A verifier suite that rejects more than half of correct candidates — rather than 5% of them — is far less individually useful. But its information capacity has soared:

$$h(0.512) \approx 1.00 \text{ bit per composite}$$

versus 0.286 bits per individual base verifier. That is a 3.5× increase in information density per evaluation.

AND-composition is, in this sense, an entropy maximizer. By pulling the composite FNR toward 0.5 (the binary-entropy-maximum), it converts accurate-but-information-poor verifiers into mediocre-but-information-rich ones.

The architectural insight: optimizing each verifier for accuracy is the wrong objective. Optimizing the suite's composite information capacity is the right one. AND-composition is one way to do it; there are others.

Connection to broader practice

This result has uncomfortable implications for several common training schemes:

Reward modeling. Training a reward model to be more accurate often reduces its information density, holding the rest of the training loop fixed. The standard intuition that "better reward models give better gradients" needs revision under this framework.
Constitutional AI / RLAIF loops. If the constitutional verifier becomes nearly perfect, its signal vanishes. The system needs to either build deliberate variance into the verifier (e.g., adversarial perturbations) or compose multiple verifiers to expand information density.
Self-rewarding language models. A model that perfectly judges its own outputs provides zero learning signal. The judge must err in ways correlated with errors in the policy — the same mechanism that powers AND-composition.
Constraint-based verification (Carnot's domain). A Z3 SMT solver applied to formal correctness has FNR exactly zero on satisfiable formulas. So Z3 alone provides zero information about which valid programs to prefer. The architecture requires non-Boolean adjacent verifiers (property-based fuzzing, semantic checks) to bring information into the system.

The rule of thumb

If your training loop has any verifier-filtering mechanism, ask: what fraction of true-output candidates does the filter reject? If the answer is "almost none," the filter is providing almost no information per round. The model is learning slowly, or not at all, regardless of how much compute you throw at it.

The fix is not better verifiers. It is composing your existing ones in a way that pulls the rejection rate toward 50% — the Shannon-channel maximum — while keeping the rejected set genuinely correlated with errors in the policy.

This is a recipe for a foundation-model self-distillation architecture, and it is the central insight underlying Carnot's Phase 3 design. We will write more about that architecture — and prove, formally, that it saturates the corresponding lower bound — in our position paper, currently in preparation.