Five FATAL Findings Three Deep Think Rounds Missed
We ran three rounds of rigorous theoretical review on a piece of our architecture. All three approved it. We then asked a fourth, blind-spot-focused review the obvious question — "what are we missing?" — and got five fatal findings, one of them entirely outside the eight problem categories the review had been told to look for. This is a note on what those five findings were, what the unprompted one in particular tells us about the limits of theoretical review, and why we now treat empirical instrumentation as a load-bearing part of every architecture decision.
The setup
The architecture under review was Phase 3 of the Carnot roadmap: a deterministic bounded autoencoder feeding a latent energy-based model, with the energy distilled down to a quadratic Ising form so the inner sampling loop could run on an FPGA. The motivation was speed. A discrete Ising sampler on dedicated silicon is several orders of magnitude faster than continuous Langevin dynamics on a GPU, and the Phase 3 deployment goal had a per-token latency budget that GPUs couldn't make.
The architecture had been through three Deep Think rounds. Each round was a multi-prompt synthesis pass against a frontier reasoning model, focused on a different layer of the design: encoder geometry, EBM training stability, FPGA primitive choice. Each round produced a long, careful, well-cited document. Each document concluded that the design was sound subject to specific empirical questions, which we planned to answer in the prototype.
At which point we noticed something uncomfortable. The questions the rounds had asked were the questions a designer of this kind of architecture would think to ask. The rounds had not been asked to find unknown unknowns. They had been asked to evaluate candidate concerns. We didn't have an answer to the question we hadn't thought of.
So we ran a fourth round, with a different prompt: tell us what the first three rounds missed. We enumerated eight categories of potential blind spot we had identified in advance and asked the model to attack each one, plus a final open-ended "anything outside these eight." The result is what this post is about.
The five findings
Five separate problems, any one of which would have invalidated the architecture at deployment time. We list them in the order they were reported; their dependencies are tangled enough that any ordering misleads.
Dimensionality and sparsity guillotine
The FPGA hardware available to us supports an Ising graph with
N = 128 spins and MAX_DEGREE = 32. The
Deep Think rounds had specified a continuous EBM with latent
dimension d ∈ {256, 512}. Distilling that EBM to a
quadratic surrogate produces a dense d × d
Hessian. Forcing a dense Hessian into a graph with maximum
degree 32 throws away over 93% of the energy topology for
d = 512. The verifier manifold catastrophically
distorts.
The first three rounds had reasoned about the EBM and the FPGA independently. Nothing in any of them put the two side by side and asked whether the EBM's natural dimensionality fit the FPGA's natural graph.
Synchronous Glauber limit-cycle collapse
This is the finding the eight-category prompt did not enumerate. It was returned under the "anything outside these eight" slot. It is the highest-impact finding because of where it came from.
The FPGA's Ising sampler uses synchronous parallel Glauber updates on a bipartite checkerboard schedule — standard practice for parallelizing Markov-chain Monte Carlo on a fixed graph. Synchronous parallel updates on an Ising lattice with frustrated or antiferromagnetic couplings mathematically violate detailed balance and do not converge to the Boltzmann distribution they're meant to sample. Instead, the chain deterministically collapses into infinite period-2 oscillations between two complementary spin patterns. The hardware sampler is not sampling. It is blinking.
The implication for already-published Carnot work is sharp. An earlier experiment that reported a 12,788× speedup of the FPGA Ising sampler over a CPU reference was comparing a correct sampler (CPU, sequential, detailed-balance-preserving) against a non-equilibrium one (FPGA, synchronous, not actually sampling the target distribution). The speedup number is real; the comparison is not. We have since labeled that result as preliminary, pending rework with a verified sampler.
Update, 2026-05-23. The rework arrived. A follow-up empirical audit (the Phase-3 Empirical-Readiness round) drove three measurements that closed both halves of this finding. First, a Maximum Mean Discrepancy test between CPU sequential Gibbs energies and KV260 synchronous-Glauber energies on identical n=64 Ising problems over three seeds: MMD p-values at the noise floor (~10-3), KS statistics of 0.998–0.999. The distributions are statistically distinguishable; the FPGA is not sampling Boltzmann. Second, a CPU baseline running the same synchronous-parallel schedule the FPGA runs: 23.6 μs per sample at n=64, which makes the apples-to-apples FPGA-vs-CPU speedup at n=64 a factor of 0.98× — the FPGA is two percent slower than the CPU. The 12,788× number was both apples-to-oranges on the sampler comparison AND extrapolated past a crossover point the architecture's actual latent dimensions don't cross. The honest paper claim is no longer a speedup at all; it is a proof-of-concept latency anchor for a substrate that becomes interesting at scales we have not yet measured.
Hopfield capacity mode collapse
A pairwise Ising model E = z^T J z is, up to a sign
convention, a Hopfield network. Hopfield's classic result puts
the storage capacity at roughly 0.14 N independent
patterns for a fully-connected graph; with maximum degree
capped, the bound is tighter still. For our hardware that is
fewer than 20 independent local minima. The continuous EBM
aims to express thousands of semantically distinct linguistic
basins. Distilling thousands of modes into a structure that can
hold fewer than 20 is exactly the catastrophic repetitive-text
failure mode this architecture was designed to avoid.
Taylor-induced spurious black holes
Deep EBMs are non-convex. Their Hessians contain negative
eigenvalues. The distillation step sets J to the
Hessian of the deep EBM at a chosen anchor point, inheriting
those negative eigenvalues directly. On unbounded
ℝ^d the quadratic surrogate goes to negative
infinity along negative-eigenvalue eigenvectors — an
unphysical pathology that the original deep EBM doesn't have
because the original wasn't quadratic in the first place.
On the FPGA's bounded {-1, +1}^d hypercube the
pathology takes a physically realized form. The global minima
of the surrogate manifest at extreme corners aligned with the
negative-eigenvalue axes — corners the original EBM
correctly evaluates as high-energy junk. The Glauber
sampler plunges into these corners and stays there. The
hardware confidently outputs configurations the original
model would have rejected as gibberish.
Higher-order logic eradication
The verifier ensemble that drives Carnot's signal is fifteen base verifiers composed under AND. Several of those verifiers impose non-linear logical syntaxes — XOR, parity, tree-depth, mutual exclusion. A purely-observable quadratic Ising is mathematically restricted to pairwise (2-SAT) interactions. Pairwise graphs are physically incapable of representing XOR without auxiliary hidden variables. This is a textbook computational complexity result, not a subtlety.
Distillation strips all >2-order logical
composition. The deployed FPGA surrogate optimizes an
ungrounded energy landscape structurally incapable of
enforcing the exact logical constraints the verifier was
trained to respect. The hardware can be confident; it cannot
be correct.
Why the unprompted finding matters most
Four of the five findings can in principle be derived from the architecture's stated structure. Once you see them, you can trace back through the original design documents and identify where the question should have been asked. The Hopfield capacity result is in any neural-networks textbook. The Taylor expansion's negative- eigenvalue pathology is a standard result. The XOR-vs-pairwise bound is a textbook computational complexity result. The Hessian- vs-FPGA-graph dimensionality mismatch is arithmetic.
Finding #2 is different. It does not come from the architecture's structure. It comes from the implementation choice for the sampler, which the first three rounds had taken as a black box. A reviewer would have to know that synchronous parallel Glauber on bipartite-but-frustrated graphs violates detailed balance, and would also have to think to ask whether the FPGA's scheduler was producing such a graph in practice. The first three rounds had no reason to ask this. They had been reasoning about the EBM, not about the sampler's stochastic dynamics.
What we did about it
Concretely, the response was twofold. First, we re-scoped the hardware track. The deep-EBM-on-FPGA goal was retired as a near- term Phase 3 deliverable; the KV260 board stays in service as a proof-of-concept of "energy evaluable in dedicated hardware on simple quadratic constraints," and future production hardware targets shift to substrates that can express higher-order interactions natively (Extropic Z1 or equivalent). The architecture is honest about this in the public roadmap.
Second, we codified the procedure that found the five findings. Every architecture phase in Carnot now ships under a single three-part discipline before any scaling decision is committed:
- A software prototype. Concrete code in the repository, runnable end-to-end at small scale, not just an architecture document.
-
Empirical validation criteria. A documented
list of measurable pass/fail tests with explicit thresholds.
For Phase 3 that meant, among others: track the per-iteration
grounding signal
$\alpha_t$across MLD steps; measure$\inf_t \alpha_t > 0$; measure the decoder's joint-constraint pass rate; measure the KL divergence between the FPGA sampler's empirical distribution and a Gibbs reference. The Glauber limit-cycle pathology would have been instantly visible to the KL check the first time it ran. - An adversarial check. A hostile-reviewer round explicitly commissioned to find ways the prototype could pass its acceptance gates without actually working. Required before scaling, not after. The five findings above came from exactly this kind of round.
Empirical instrumentation IS the adversarial check at scale. A
prototype that emits the right diagnostics surfaces architecture-
level flaws automatically. A prototype that does not will let
flaws ship. We therefore now require every phase prototype to
include the diagnostic instrumentation for every
theoretical concern the phase rests on. $\alpha_t$
tracking. Joint null-space estimation. KL-to-target measurement.
Decoded-text diversity. None of these are optional.
The cost of doing this versus the cost of not
We want to be honest about the trade. The three Deep Think rounds took roughly a week of human time and a meaningful amount of model compute. The blind-spot audit took half a day. The replacement empirical instrumentation runs on every milestone and adds about two percent to the cost of each prototype run.
Counterfactually, the cost of shipping the architecture without the blind-spot round was on the order of a full FPGA bitstream bring-up plus a downstream paper draft that would have asserted measurements which the sampler does not actually support. We do not know exactly how much real-world wall-clock that would have been; we know it would have been more than half a day. We also know that the wrong measurements, once published, are approximately impossible to retract cleanly.
What generalizes
Three operational takeaways from the exercise that apply beyond Carnot:
- Add an explicit "anything outside these N categories" slot to every blind-spot review. Finding #2 lived in that slot. A review prompt that does not invite the reviewer to notice the unknown unknown is a review that will not find it.
- Wire instrumentation for every theoretical concern, not just the ones you think you'll need. Each of the four "derivable" findings above has an instrument that would catch it on the first prototype run. We did not have any of those instruments before the audit, because nobody had thought to look for the failure they would catch. After the audit, all four exist.
- Treat the speedup numbers from any non-instrumented sampler as preliminary. A Markov chain that violates detailed balance can be arbitrarily fast and arbitrarily wrong. A KL-to-reference measurement, taken once at small scale, is cheap insurance against the speedup-by-cheating failure mode.
Carnot is a project about verifying language-model outputs with energy-based methods. It would be embarrassing if our own architecture review failed to verify itself with the same rigor. This is the procedure we adopted so that it does.
Further reading
- The full audit document, including the four DEGRADING and one COSMETIC findings we elided here: phase3-architecture-blindspot-audit-results.md
- The discipline rule, codified in the project's CLAUDE.md — search for "Phase Prototype + Empirical Validation + Adversarial Check Discipline."
- The retire-and-rescope decision document: phase3-dae-debm-pivot-decision.md
- Companion posts: Caught Cheating (fabricated artifacts) and Regex in an NTK Costume (disguised verifiers). The three posts describe the three independent failure modes our discipline catches.