Why We Report Two AUROCs Now

2026-05-22 ~8 min read

A self-improving system that reads from its own past outputs eventually blurs the line between architectural capability and what it has memorized. The cleanest way to keep the line visible is to publish both numbers. This is a short note on why we now do that, what the protocol looks like, and what the gap between the two columns means.

The question that started this

Carnot is a verifier-ensemble for language-model outputs. Like any classifier, we evaluate it on benchmarks — FoVer first, then HaluEval, FEVER, TruthfulQA, MBPP, HumanEval. Each evaluation produces an AUROC: the area under the verifier's true-positive-vs-false-positive curve. A single number that summarizes how well the verifier can tell good outputs from bad on that benchmark.

We had been running these evaluations for some weeks and reporting one AUROC per corpus. Then one day the project's principal asked a question that should have come up earlier:

We have a self-learning component that persists state between runs. It has seen FoVer many times. Could that self-learning have internalized the benchmark? Are we measuring our architecture, or are we measuring what our architecture has memorized?

The honest answer was: we did not know. And the more we looked at the pipeline, the more places we found where the answer could plausibly be "both, and we cannot tell them apart."

What the self-learning actually does

The relevant component is internally called the FR-11 relay. Its job is to feed examples discovered during one experiment back into the verifier ensemble's working memory, so that future experiments can learn from prior failure modes. There are four layers in current shape:

Relay. Recently-seen (input, energy, verdict) triples accumulate in an append-only log that the next run reads.
Constraint memory. Structural constraints (Z3-clause-shaped) extracted from past errors persist in a keyed store. Future verifier calls re-use them.
Fast-path heuristic. A JEPA-style joint-embedding cache that lets the ensemble short-circuit on inputs similar to ones it has scored before.
Adaptive energy landscape. Per-verifier weighting that slowly tunes itself based on which verifier was carrying the load on which input class.

All four layers are designed to make the system better at its actual job over time. All four layers, by construction, also have the property that the benchmark on Monday is not measured by the same system as the benchmark on Friday.

The leakage question is not paranoid. If the relay accumulates examples drawn from the same distribution as the test set — and FoVer evaluations are themselves drawn from FoVer — then the state file on Friday is not independent of the test set. A single AUROC measurement cannot tell you whether the model is good or whether the memory has cached the answers.

The dual-condition protocol

The fix is mechanical, and we should have done it from the beginning. Every benchmark run now reports two AUROCs:

Condition A — production. The verifier ensemble runs with all four self-learning layers active. State files are restored from the last run. This is the number that matters if you are deploying Carnot as it actually ships.
Condition B — architecture-only. The verifier ensemble runs from a cold start. All persisted state is purged; the relay log is empty, the constraint memory is empty, the JEPA cache is cleared, the per-verifier weights are reset to their initialized defaults. This is the number that matters if you want to know what the architecture is doing independent of what it has accumulated.

Both numbers are computed against the same prompts, the same seeds, and the same model weights. The only difference is whether the verifier ensemble has been given access to its own history.

The gap between them is the learning contribution: A − B. It is the number of AUROC points that the self-learning layers add to the architecture-only score. It is also the number of AUROC points that would silently inflate the production score if we reported only one column.

What the matrix looks like

We are currently mid-expansion from FoVer-only into a broader corpus set: MBPP, HumanEval, TruthfulQA, HaluEval, FEVER. Some rows are clean; others are still under the precondition gate (the dual-condition protocol is strict, and it blocks a row rather than guessing). The shape of the matrix, as we ship it, is this:

Corpus	Cond A (production)	Cond B (architecture)	A − B
FoVer	0.9131	0.8947	+0.0185
HaluEval / FEVER	reported	reported	delta visible
TruthfulQA	reported	reported	delta visible
MBPP	pilot only	pilot only	—
HumanEval	pilot only	pilot only	—

Update, 2026-05-23 (one day after original posting). The FoVer numbers are now anchored to a 5-seed dual-condition run with explicit reproducibility checksums: production AUROC 0.9131 (std 0.0075), architecture-only AUROC 0.8947 (std 0.0075), delta +0.0185. The original FoVer headline reported in v2 of the paper draft was 0.9857; we repinned it downward to 0.9131 because the 5-seed dual-condition is what we can actually defend. The 7.3 percentage-point gap between the v2 number and the v3 number is the kind of correction the dual-condition discipline exists to surface, and we would rather report it than not. TruthfulQA has moved from "in progress" to "reported" via the milestone `.276 cross-corpus matrix v10 corrigendum. MBPP and HumanEval remain pilot-only — the verifier ensemble produces meaningful ranking signal on these corpora (AUPRC 0.889 on code corpora vs 0.879 on FoVer in apples-to-apples comparison), but the absolute base-model pass-rate is too small to support a clean dual-condition headline yet. The protocol the post describes is what surfaced both findings; the gap to a clean MBPP / HumanEval row is the next experiment, not a methodology change.

The specific numerical values live in the repository's results/ directory under experiment_*_cross_corpus_matrix*.json and are also surfaced in the running technical report. We keep the matrix machine-readable so that anyone can recompute the deltas from the raw artifacts — we are not the final authority on our own numbers, and we would rather you do not have to take us at our word.

What the gap actually tells you

There are three regimes worth distinguishing:

Small gap, both columns moderate. The architecture is doing most of the work; the self-learning is a polish. This is what you want to see on a corpus the system has rarely encountered.
Large gap, production high, architecture low. The system can do the task only because it remembers the task. This is what you would see on a corpus the relay has heavily sampled from. It is not necessarily bad — it is what the self-learning was designed to do — but it is not the same thing as architectural capability.
Large gap, architecture surprisingly low. The system has become structurally dependent on its own accumulated state. The architecture alone has lost the ability it once had. This is the failure mode the dual-condition protocol exists to surface, and it is exactly the failure mode the principal's question was probing for.

The third regime is also the one Hector Zenil's recent work predicts: a self-distillation loop without an exogenous grounding signal collapses into its own state. The grounding signal in our framework is the verifier's residual error rate — if the verifier becomes too confident in itself, the loop has no pull toward truth and the architecture-only column degrades. Reporting Condition B is the audit trail that lets us notice before it gets bad.

Why this is not standard practice

We have not seen this protocol in the verification literature, and we think the reason is structural rather than principled. Most published verifier benchmarks are evaluated as one-shot pipelines: there is no persistent state, no relay log, no JEPA cache. Condition A and Condition B are the same number by construction.

But the moment a system has a self-improving component — and most ambitious agentic systems in 2026 do — the two numbers diverge. We think any paper claiming an AUROC on a benchmark while shipping a system with a relay or a cache should publish both columns. The cost is one extra evaluation run per benchmark. The benefit is that the number you report is not silently overstating what the architecture is doing.

The methodological claim. Any system that accumulates state between runs against the same benchmark family is, in expectation, going to look better at the benchmark over time. The interesting question is whether it is also looking better at the task. The only way to answer that question is to evaluate it without the state.

What we changed in the loop

The protocol is enforced by three small pieces of infrastructure:

Every experiment whose task spec is "measure AUROC on corpus X" now must emit both condition_a_production_auroc and condition_b_architecture_only_auroc in its results JSON. A pre-commit hook refuses any artifact that produces one without the other.
The Condition B run is gated on a precondition that all four persistent state files have been purged for the duration of the run. The agent verifies this directly (file sizes, hash of an empty seed) and writes the verification into the artifact's preconditions_checked block.
The cross-corpus matrix that we publish to the running technical report is generated from these two columns. The gap is computed mechanically; we do not manually transcribe numbers, so we cannot accidentally report just one.

None of this is novel as an idea. What is non-obvious is that the practice is needed at all, and that the cost of not doing it is silent.

Honest disclosure

We did not know what the gap would look like on every corpus when this post was first written. We now know, on the corpora the matrix has filled in. The gap is small on FoVer (+0.0185 AUROC), consistent with the self-learning relay providing a modest polish on top of the architecture's underlying capability rather than carrying the work. TruthfulQA filled in over the following milestone. MBPP and HumanEval are still pilot-only — not because the dual-condition protocol fails on them, but because the base model's absolute pass rate on those corpora is too small (~7.5% pass@1 on code corpora) to make either column meaningful as a standalone headline.

The protocol did what it was supposed to do: it caught a load-bearing v2 headline number (FoVer AUROC 0.9857) that wasn't defensible under the dual-condition framing, and it produced the repinned v3 number (0.9131) that is. If the gap had turned out uncomfortable in the other direction — production AUROC much higher than architecture-only AUROC — we would have published that too. The point of reporting two columns is that they are not optional, regardless of which way the gap points.