← Back to blog Lessons

Two Retractions and a Rescue: A Pre-Submission Adversarial Audit

2026-05-23 ~11 min read

We paid for a hostile adversarial audit of our paper draft two days before we expected to submit. The audit returned seven fatal findings against our own claims. We ran three rescue measurements on the seams those findings exposed. Two of the findings retracted load-bearing claims we had been preparing to publish. One rescued a claim we had been preparing to walk back. This is what happened, what the paper now says, and what we think the methodology generalizes to.

The setup

Carnot's paper-v6 draft had hit a comfortable spot. The autonomous research loop had produced seven consecutive milestone capstones that flagged themselves paper_ready=true. The underlying claims had real artifacts behind them: a verified KV260 FPGA bitstream sampling at 24 microseconds per sample, a five-seed dual-condition FoVer AUROC of 0.9131, a cross-corpus matrix with eight headline-eligible rows, four landed Phase-4 active-inference artifacts.

We had also, three weeks earlier, run an adversarial Deep Think round against the architecture before the prototype existed. That round produced five fatal findings, one of which was entirely outside the eight problem categories the audit had been told to look for. The experience taught us a discipline: pay for the hostile review before reviewers run it for free, and reserve a slot in the audit prompt for "anything outside the categories I enumerated."

When the autonomous loop hits seven consecutive paper_ready=true capstones, the right move is not to trust the streak. It is to run that hostile review again, against the paper draft itself, with the empirical anchors the prior round did not have. That is what we did.

The audit

The audit prompt was structured to mirror the 2026-04-30 round. Same hostile-reviewer framing — "you are a reviewer at NeurIPS / ICML / ICLR trying to torch this paper, be ruthless." Same eight enumerated attack surfaces, this time targeted at the seams between our empirical anchors and the claims those anchors were going to support. Same explicit "bonus points for findings outside the eight" slot.

Eight attack surfaces, in shape:

  1. Does the 24 microsecond / sample anchor on the FPGA actually produce Boltzmann samples, or is it the time to enter a non-equilibrium steady state?
  2. Is the CPU comparator we used for the speedup claim running the same algorithm as the FPGA, or are we measuring apples-to-oranges?
  3. Does the verifier ensemble's high AUROC on FoVer survive on code corpora where the base model fails 92.5% of the time?
  4. Does our k=15 verifier ensemble share a joint blind spot on modalities outside the six we have measured on?
  5. Are Phase-4 active-inference claims (validated on a continuous sampler) being improperly cited in defense of FPGA-deployment claims (a discrete sampler)?
  6. Is the FPGA's 1.5%-p95-vs-median latency margin too tight to represent real MCMC mixing time?
  7. Is the "seven consecutive paper_ready=true" streak load-bearing for reviewers, or self-Goodharting?
  8. Does the paper's "hardware sovereignty" framing survive the proprietary-toolchain dependency chain from pip install to a running KV260 board?

Plus the ninth slot: "anything outside the eight."

The findings

Ten findings returned. Seven fatal, two degrading, one cosmetic. Two of the seven fatal findings were in the ninth slot, outside the eight enumerated surfaces. The 2026-04-30 round had produced one unprompted finding; this round produced two.

We are going to ignore the two degrading and the one cosmetic finding in this post — they were textual, and they are now in the paper. The interesting story is the seven fatal findings and what we learned by trying to falsify them empirically.

The three measurements

Four of the seven fatal findings could be rescued by textual narrowing alone — replace one framing with another, add a firewall paragraph, drop a deferred-hardware claim. Three required actual measurements: an MMD test of the FPGA sampler distribution, a same-schedule CPU comparator, and a base-rate- corrected AUPRC analysis on code corpora.

We ran them. Here is what they showed.

Retraction #1: the FPGA sampler is not sampling

The audit's first fatal finding extended a concern the 2026-04-30 round had raised about synchronous parallel Glauber on frustrated bipartite Ising graphs. The earlier round had argued theoretically that such samplers violate detailed balance and converge to non-equilibrium steady states rather than the Boltzmann distribution. The empirical question now: does our actual 24 microseconds / sample anchor on the KV260 board produce Boltzmann samples, or non-equilibrium states masquerading as them?

We ran a Maximum Mean Discrepancy test. Exact CPU sequential Gibbs (detailed-balance preserving, the correct sampler) vs the KV260's synchronous parallel Glauber, on the same three n=64 Ising problems the original anchor used, with the bitstream SHA-256 cited and reproducibility checksum recorded. Three seeds, 10,000 samples each per condition.

Seed MMD² MMD p-value KS statistic KS p-value
42 1.385 0.001 0.998 0.0
137 1.432 0.001 0.999 0.0
271 1.370 0.001 0.996 0.0

A Kolmogorov-Smirnov statistic of 0.998 means the two cumulative distributions barely overlap. The MMD p-values sat at the noise floor of the permutation test (1 / 1001 permutations) on all three seeds: the test cannot get more significant than this.

The FPGA samples are statistically distinguishable from the target Boltzmann distribution at every seed we tested. We had been preparing to claim "approximately exact sampling on silicon." We are not. The FPGA produces samples on a fixed-compute schedule that does not converge to the target distribution. The 24 microseconds per sample anchor is real wall-clock; what it measures is not what we had been representing it as.

Paper-v6 retracts the "exact sampling on FPGA" claim. The FPGA outputs are re-framed as fixed-compute heuristic samples drawn from a non-equilibrium steady state, useful as an energy-evaluation oracle but not as a sampler from the Boltzmann. This is a real narrowing of what the hardware section says.

Retraction #2: the FPGA is not faster than CPU at our scale

The audit's fourth fatal finding observed that our earlier speedup claim — KV260 was "speedup-claim-eligible" vs CPU — had used a CPU baseline running exact sequential Gibbs (the correct algorithm) while the FPGA ran synchronous parallel Glauber (the structurally broken algorithm we had just retracted above). The two were not measuring the same thing.

The honest comparison is apples-to-apples: CPU running the same synchronous parallel schedule the FPGA runs.

Substrate Algorithm Per-sample median Per-sample p95
KV260 FPGA Synchronous parallel Glauber 24.000 us 24.380 us
CPU (single-threaded Python / NumPy) Synchronous parallel Glauber (same schedule) 23.574 us 33.833 us

The apples-to-apples speedup at n=64 spins is therefore 0.982 ×. The FPGA is approximately two percent slower than the CPU running the same algorithm.

Cross-check: the same algorithm on both substrates should produce statistically identical energy distributions, since they are both sampling the same non-equilibrium steady state. They do. KS p-value 1.0, MMD² 0.0. The slowdown measurement is honest; the FPGA is, at our current dimensionality, simply slower.

There is no FPGA speedup over CPU at our actual production latent dimensions (d in {128, 256}). The KV260's per-sample latency is roughly constant in n while CPU's scales linearly; the crossover lives somewhere around n = 240 spins on this particular CPU, well above where the architecture currently operates. The "hardware speedup" framing was extrapolating across that crossover.

Paper-v6 retracts the speedup claim at current d and reframes the KV260 as a proof-of-concept functional simulator anchoring a future high-N deployment trajectory. The 24 microseconds is still a real measurement; what it is evidence for changes.

Rescue: the verifier on code corpora is real

The audit's fifth fatal finding was the prediction we expected to cost us the most. The verifier ensemble's headline AUROC on FoVer is 0.9131. The base model's pass-rate on code corpora is 7.5%. At a 92.5% negative base rate, an AUROC of 0.91 implies (under reasonable operating-point assumptions) a positive predictive value below 50 percent. Under the natural reading: when our verifier approves generated code, it is more likely wrong than right. The finding called this the hallucination multiplier failure mode. We had been preparing to retract the code-corpus active-inference claim.

We ran an AUPRC analysis at the empirical 92.5% negative base rate. The right metric. The audit's PPV math was on AUROC and was a legitimate worry from that side; the AUPRC tells a different story.

Corpus AUPRC Random baseline AUPRC Max-F1 operating point
Code corpora 0.889 0.075 F1 = 0.94, PPV = 0.89, recall = 1.00
FoVer (apples-to-apples comparison) 0.879

Code-corpus AUPRC is 0.889 against a random baseline of 0.075. That is roughly a 12-times lift over the base rate. Maximum-F1 operating point gives PPV = 0.89 and recall = 1.0; the verifier reaches PPV = 0.5 (the "more likely right than wrong" threshold the finding worried about) at a much lower threshold with recall still saturated.

And the kicker: the verifier's code-corpus AUPRC (0.889) is higher than its FoVer AUPRC (0.879) in the same apples-to-apples computation. The ranking quality on code is at least as good as the ranking quality on the corpus we trained the verifier ensemble against. The Deep Think prediction had been operating on the AUROC framing alone; the AUPRC framing dissolves it.

The verifier ensemble is not a hallucination multiplier on code corpora. It is a strong ranking signal whose utility was hidden by the wrong choice of summary metric. The code-corpus active-inference claim survives; the paper now reports AUPRC alongside AUROC, with explicit base-rate framing, to defuse the same objection a reviewer would otherwise raise.

What the paper now says

The capstone for the milestone where these measurements landed contains two explicit lists: paper_v6_safe_claims and paper_v6_forbidden_claims. We do not edit those lists by hand; the autonomous loop generates them from the artifacts. They look like this.

Defensible Anchored by
KV260 is a POC functional simulator anchoring future high-N deployment. exp2939 same-schedule speedup 0.982×
KV260 outputs are fixed-compute heuristic samples, not Boltzmann-thermalized samples. exp2938 MMD distinguishable at p ≤ 0.001
Verifier-ensemble code-corpus active inference is retainable: AUPRC = 0.889 vs base rate 0.075. exp2940 (>11× lift)
FoVer dual-condition AUROC = 0.9131 (5-seed). exp2837 via exp2940
PolarFire SoC 500-clause constraint scorer hash-verified. exp2941
No longer claimed Retracted by
KV260 samples reach Boltzmann thermalization. exp2938
KV260 hardware speedup over CPU at d in {128, 256}. exp2939
Phase-4 variational-free-energy bounds validate KV260 deployment. (textual firewall — Phase-4 applies only to the continuous-sampler RTX 3090 deployment)
Extropic Z1 / photonic as the future production target. (post-pivot architecture is Boolean-coupled; analog substrates cannot enforce discrete sign constraints. Future production target is digital ASICs / spatial FPGAs / digital Ising machines.)

The paper got narrower. The narrower paper is also one whose every load-bearing claim cites a specific artifact with a reproducibility checksum, an inference-substrate declaration, and an explicit retire-condition. The narrower paper is the defensible paper. The broader paper would have been the torched paper.

What we think generalizes

Three operational takeaways. None are about Carnot specifically.

  1. Pay for the hostile audit before reviewers do it for free. A pre-submission adversarial round costs the operator about thirty minutes of prompt-shepherding and a short period of writing the prompt scaffolding. A reviewer- surfaced version of any of the seven fatal findings would have cost a desk reject or a major-revisions cycle. The math is not close.
  2. Reserve a slot for "anything outside my categories." Two of the seven fatal findings here, and one of the five in the 2026-04-30 round, came from that slot. The auditor's enumerated categories are the questions the auditor is good at asking. The unenumerated slot is the question the auditor did not know to ask. Across two rounds in the same project, three of the twelve fatal findings — one in four — came from the unenumerated slot. That is too high a rate to leave the slot unfilled.
  3. Treat your own metrics as defendants, not witnesses. The verifier-on-code-corpus rescue happened because we re-asked the metric question (AUROC versus AUPRC) with the base rate explicit. The retraction on FPGA speedup happened because we re-asked the comparator question (CPU sequential Gibbs versus CPU synchronous parallel). In both cases the original framing was the natural one; the corrected framing was what a hostile reviewer would have demanded. Get to the corrected framing yourself, in private, before submission.

The capstone field that did not exist a week ago

The milestone capstone after this round had a new field that the earlier capstones did not: headline_outcome: narrow. The capstone, generated by the autonomous loop without operator intervention, explicitly tagged the milestone's paper-readiness as "narrower than before" — tracking the two retractions the audit had just forced. The loop is now reporting what it retracted, not just what it confirmed.

That field will not exist if the audit had never run. The audit exists because we set up the structure to run it. The structure exists because the 2026-04-30 round taught us what it cost not to have it. The lesson compounds.

Further reading

About this post. This is the fifth in a series of operational notes from the Carnot project on building an autonomous research loop we can actually trust. The earlier four posts described the discipline machinery — how we catch fabricated artifacts, disguised verifier code, pre-prototype architectural flaws, and self-learning leakage in reported numbers. This one is about what happens when the discipline machinery fires on the paper itself.