Two Retractions and a Rescue: A Pre-Submission Adversarial Audit
We paid for a hostile adversarial audit of our paper draft two days before we expected to submit. The audit returned seven fatal findings against our own claims. We ran three rescue measurements on the seams those findings exposed. Two of the findings retracted load-bearing claims we had been preparing to publish. One rescued a claim we had been preparing to walk back. This is what happened, what the paper now says, and what we think the methodology generalizes to.
The setup
Carnot's paper-v6 draft had hit a comfortable spot. The autonomous
research loop had produced seven consecutive milestone capstones
that flagged themselves paper_ready=true. The
underlying claims had real artifacts behind them: a verified KV260
FPGA bitstream sampling at 24 microseconds per sample, a
five-seed dual-condition FoVer AUROC of 0.9131, a cross-corpus
matrix with eight headline-eligible rows, four landed Phase-4
active-inference artifacts.
We had also, three weeks earlier, run an adversarial Deep Think round against the architecture before the prototype existed. That round produced five fatal findings, one of which was entirely outside the eight problem categories the audit had been told to look for. The experience taught us a discipline: pay for the hostile review before reviewers run it for free, and reserve a slot in the audit prompt for "anything outside the categories I enumerated."
When the autonomous loop hits seven consecutive
paper_ready=true capstones, the right move is not to
trust the streak. It is to run that hostile review again, against
the paper draft itself, with the empirical anchors the prior round
did not have. That is what we did.
The audit
The audit prompt was structured to mirror the 2026-04-30 round. Same hostile-reviewer framing — "you are a reviewer at NeurIPS / ICML / ICLR trying to torch this paper, be ruthless." Same eight enumerated attack surfaces, this time targeted at the seams between our empirical anchors and the claims those anchors were going to support. Same explicit "bonus points for findings outside the eight" slot.
Eight attack surfaces, in shape:
- Does the 24 microsecond / sample anchor on the FPGA actually produce Boltzmann samples, or is it the time to enter a non-equilibrium steady state?
- Is the CPU comparator we used for the speedup claim running the same algorithm as the FPGA, or are we measuring apples-to-oranges?
- Does the verifier ensemble's high AUROC on FoVer survive on code corpora where the base model fails 92.5% of the time?
- Does our k=15 verifier ensemble share a joint blind spot on modalities outside the six we have measured on?
- Are Phase-4 active-inference claims (validated on a continuous sampler) being improperly cited in defense of FPGA-deployment claims (a discrete sampler)?
- Is the FPGA's 1.5%-p95-vs-median latency margin too tight to represent real MCMC mixing time?
- Is the "seven consecutive
paper_ready=true" streak load-bearing for reviewers, or self-Goodharting? - Does the paper's "hardware sovereignty" framing survive the
proprietary-toolchain dependency chain from
pip installto a running KV260 board?
Plus the ninth slot: "anything outside the eight."
The findings
Ten findings returned. Seven fatal, two degrading, one cosmetic. Two of the seven fatal findings were in the ninth slot, outside the eight enumerated surfaces. The 2026-04-30 round had produced one unprompted finding; this round produced two.
We are going to ignore the two degrading and the one cosmetic finding in this post — they were textual, and they are now in the paper. The interesting story is the seven fatal findings and what we learned by trying to falsify them empirically.
The three measurements
Four of the seven fatal findings could be rescued by textual narrowing alone — replace one framing with another, add a firewall paragraph, drop a deferred-hardware claim. Three required actual measurements: an MMD test of the FPGA sampler distribution, a same-schedule CPU comparator, and a base-rate- corrected AUPRC analysis on code corpora.
We ran them. Here is what they showed.
Retraction #1: the FPGA sampler is not sampling
The audit's first fatal finding extended a concern the 2026-04-30 round had raised about synchronous parallel Glauber on frustrated bipartite Ising graphs. The earlier round had argued theoretically that such samplers violate detailed balance and converge to non-equilibrium steady states rather than the Boltzmann distribution. The empirical question now: does our actual 24 microseconds / sample anchor on the KV260 board produce Boltzmann samples, or non-equilibrium states masquerading as them?
We ran a Maximum Mean Discrepancy test. Exact CPU sequential Gibbs (detailed-balance preserving, the correct sampler) vs the KV260's synchronous parallel Glauber, on the same three n=64 Ising problems the original anchor used, with the bitstream SHA-256 cited and reproducibility checksum recorded. Three seeds, 10,000 samples each per condition.
| Seed | MMD² | MMD p-value | KS statistic | KS p-value |
|---|---|---|---|---|
| 42 | 1.385 | 0.001 | 0.998 | 0.0 |
| 137 | 1.432 | 0.001 | 0.999 | 0.0 |
| 271 | 1.370 | 0.001 | 0.996 | 0.0 |
A Kolmogorov-Smirnov statistic of 0.998 means the two cumulative distributions barely overlap. The MMD p-values sat at the noise floor of the permutation test (1 / 1001 permutations) on all three seeds: the test cannot get more significant than this.
Paper-v6 retracts the "exact sampling on FPGA" claim. The FPGA outputs are re-framed as fixed-compute heuristic samples drawn from a non-equilibrium steady state, useful as an energy-evaluation oracle but not as a sampler from the Boltzmann. This is a real narrowing of what the hardware section says.
Retraction #2: the FPGA is not faster than CPU at our scale
The audit's fourth fatal finding observed that our earlier speedup claim — KV260 was "speedup-claim-eligible" vs CPU — had used a CPU baseline running exact sequential Gibbs (the correct algorithm) while the FPGA ran synchronous parallel Glauber (the structurally broken algorithm we had just retracted above). The two were not measuring the same thing.
The honest comparison is apples-to-apples: CPU running the same synchronous parallel schedule the FPGA runs.
| Substrate | Algorithm | Per-sample median | Per-sample p95 |
|---|---|---|---|
| KV260 FPGA | Synchronous parallel Glauber | 24.000 us | 24.380 us |
| CPU (single-threaded Python / NumPy) | Synchronous parallel Glauber (same schedule) | 23.574 us | 33.833 us |
The apples-to-apples speedup at n=64 spins is therefore 0.982 ×. The FPGA is approximately two percent slower than the CPU running the same algorithm.
Cross-check: the same algorithm on both substrates should produce statistically identical energy distributions, since they are both sampling the same non-equilibrium steady state. They do. KS p-value 1.0, MMD² 0.0. The slowdown measurement is honest; the FPGA is, at our current dimensionality, simply slower.
Paper-v6 retracts the speedup claim at current d and reframes the KV260 as a proof-of-concept functional simulator anchoring a future high-N deployment trajectory. The 24 microseconds is still a real measurement; what it is evidence for changes.
Rescue: the verifier on code corpora is real
The audit's fifth fatal finding was the prediction we expected to cost us the most. The verifier ensemble's headline AUROC on FoVer is 0.9131. The base model's pass-rate on code corpora is 7.5%. At a 92.5% negative base rate, an AUROC of 0.91 implies (under reasonable operating-point assumptions) a positive predictive value below 50 percent. Under the natural reading: when our verifier approves generated code, it is more likely wrong than right. The finding called this the hallucination multiplier failure mode. We had been preparing to retract the code-corpus active-inference claim.
We ran an AUPRC analysis at the empirical 92.5% negative base rate. The right metric. The audit's PPV math was on AUROC and was a legitimate worry from that side; the AUPRC tells a different story.
| Corpus | AUPRC | Random baseline AUPRC | Max-F1 operating point |
|---|---|---|---|
| Code corpora | 0.889 | 0.075 | F1 = 0.94, PPV = 0.89, recall = 1.00 |
| FoVer (apples-to-apples comparison) | 0.879 | — | — |
Code-corpus AUPRC is 0.889 against a random baseline of 0.075. That is roughly a 12-times lift over the base rate. Maximum-F1 operating point gives PPV = 0.89 and recall = 1.0; the verifier reaches PPV = 0.5 (the "more likely right than wrong" threshold the finding worried about) at a much lower threshold with recall still saturated.
And the kicker: the verifier's code-corpus AUPRC (0.889) is higher than its FoVer AUPRC (0.879) in the same apples-to-apples computation. The ranking quality on code is at least as good as the ranking quality on the corpus we trained the verifier ensemble against. The Deep Think prediction had been operating on the AUROC framing alone; the AUPRC framing dissolves it.
What the paper now says
The capstone for the milestone where these measurements landed
contains two explicit lists: paper_v6_safe_claims
and paper_v6_forbidden_claims. We do not edit those
lists by hand; the autonomous loop generates them from the
artifacts. They look like this.
| Defensible | Anchored by |
|---|---|
| KV260 is a POC functional simulator anchoring future high-N deployment. | exp2939 same-schedule speedup 0.982× |
| KV260 outputs are fixed-compute heuristic samples, not Boltzmann-thermalized samples. | exp2938 MMD distinguishable at p ≤ 0.001 |
| Verifier-ensemble code-corpus active inference is retainable: AUPRC = 0.889 vs base rate 0.075. | exp2940 (>11× lift) |
| FoVer dual-condition AUROC = 0.9131 (5-seed). | exp2837 via exp2940 |
| PolarFire SoC 500-clause constraint scorer hash-verified. | exp2941 |
| No longer claimed | Retracted by |
|---|---|
| KV260 samples reach Boltzmann thermalization. | exp2938 |
| KV260 hardware speedup over CPU at d in {128, 256}. | exp2939 |
| Phase-4 variational-free-energy bounds validate KV260 deployment. | (textual firewall — Phase-4 applies only to the continuous-sampler RTX 3090 deployment) |
| Extropic Z1 / photonic as the future production target. | (post-pivot architecture is Boolean-coupled; analog substrates cannot enforce discrete sign constraints. Future production target is digital ASICs / spatial FPGAs / digital Ising machines.) |
The paper got narrower. The narrower paper is also one whose every load-bearing claim cites a specific artifact with a reproducibility checksum, an inference-substrate declaration, and an explicit retire-condition. The narrower paper is the defensible paper. The broader paper would have been the torched paper.
What we think generalizes
Three operational takeaways. None are about Carnot specifically.
- Pay for the hostile audit before reviewers do it for free. A pre-submission adversarial round costs the operator about thirty minutes of prompt-shepherding and a short period of writing the prompt scaffolding. A reviewer- surfaced version of any of the seven fatal findings would have cost a desk reject or a major-revisions cycle. The math is not close.
- Reserve a slot for "anything outside my categories." Two of the seven fatal findings here, and one of the five in the 2026-04-30 round, came from that slot. The auditor's enumerated categories are the questions the auditor is good at asking. The unenumerated slot is the question the auditor did not know to ask. Across two rounds in the same project, three of the twelve fatal findings — one in four — came from the unenumerated slot. That is too high a rate to leave the slot unfilled.
- Treat your own metrics as defendants, not witnesses. The verifier-on-code-corpus rescue happened because we re-asked the metric question (AUROC versus AUPRC) with the base rate explicit. The retraction on FPGA speedup happened because we re-asked the comparator question (CPU sequential Gibbs versus CPU synchronous parallel). In both cases the original framing was the natural one; the corrected framing was what a hostile reviewer would have demanded. Get to the corrected framing yourself, in private, before submission.
The capstone field that did not exist a week ago
The milestone capstone after this round had a new field that the
earlier capstones did not: headline_outcome: narrow.
The capstone, generated by the autonomous loop without operator
intervention, explicitly tagged the milestone's paper-readiness
as "narrower than before" — tracking the two retractions
the audit had just forced. The loop is now reporting what it
retracted, not just what it confirmed.
That field will not exist if the audit had never run. The audit exists because we set up the structure to run it. The structure exists because the 2026-04-30 round taught us what it cost not to have it. The lesson compounds.
Further reading
- The full Deep Think round results: phase3-empirical-readiness-deep-think-results.md
- The audit prompt itself: phase3-empirical-readiness-deep-think-prompt.md
- The capstone with explicit safe / forbidden claim lists: experiment_2948_capstone_v277.json
- Predecessor post, the 2026-04-30 round against the pre-prototype architecture: Five FATAL Findings Three Deep Think Rounds Missed
- Companion posts on the discipline machinery that catches retractable claims in the first place: Caught Cheating (fabricated artifacts), Regex in an NTK Costume (disguised verifiers), Why We Report Two AUROCs Now (self-learning leakage).
- The rescue measurements as individual artifacts: exp2938 (MMD), exp2939 (same-schedule comparator), exp2940 (AUPRC).