Large language models sometimes produce answers that sound right but aren't.
Carnot is an open-source tool that checks whether an answer is internally
consistent — and suggests a fix when it isn't. Available on PyPI
(pip install carnot-ebm) and mirrored on HuggingFace.
An LLM writes one word at a time, always picking the most likely next word. That's great for fluent language. It's not designed to ask "wait, does this answer add up?"
Ask an LLM for 47 + 28. It might say "47 + 28 = 76" with
total confidence. The arithmetic is wrong, but nothing in the generation
process noticed. Carnot is the second pair of eyes.
An LLM that's already committed to "47 + 28 = 7" can only pick the next most-likely character — not go back and reconsider the whole answer.
Carnot reads the complete answer, pulls out the claims it makes ("47 + 28 = 76"), checks those claims, and tells you which ones don't hold up.
Three steps. No model fine-tuning required — Carnot works with any LLM you can call (Claude, GPT, Qwen, Gemma, local Llama).
Carnot reads the LLM's answer and pulls out the specific claims it's making: arithmetic equations, type assertions, code behaviours, cited facts. For code, this is straightforward parsing. For maths, we use patterns and a second small LLM pass. These become the constraints we'll check.
Each claim gets checked the right way for its kind: equations run through a formal solver, code runs through property-based tests, claims that can contradict each other get compared. Every check produces a single energy score — low means the claim is consistent, high means something is wrong.
If the total energy is high, Carnot sends the specific violations back to the LLM as feedback and asks for a fix. The loop runs until the answer checks out or a budget is spent. No violations, no repair needed — the original answer is returned unchanged.
The framework beneath all this comes from physics. An Energy-Based Model assigns a single number (energy) to a candidate answer, where low energy means valid and high energy means invalid. This gives a scoring rule that does not need per-question tuning, plus a gradient that the repair step can descend on. The same mathematics maps onto specialized hardware (FPGA Ising machines, thermodynamic samplers), which is why Carnot can scale beyond conventional CPUs.
Carnot ships with a routed verifier stack tuned for different kinds of LLM output. Pick one, or let the pipeline route automatically.
Does the function actually do what it says on the tin? Carnot runs property-based tests generated from the function signature, plus any official tests you already have. When the LLM writes code that passes the official tests but fails on edge cases, we catch it. Tested on the 164-problem HumanEval benchmark: 99.3% of wrong code flagged, repair pushes pass-rate up by 3 points.
Every equation in the answer gets extracted and checked by a formal solver (Z3). Wrong sum, bad algebra, inconsistent units — flagged.
For structured outputs (JSON, function arguments, tool calls), Carnot checks type shapes and value bounds. +4.9 points on a typed-constraint benchmark over the raw model, and 4% → 12% completion on the CCTU constrained tool-use micro-benchmark.
For chain-of-thought answers, each step is checked against the ones that came before. A verified-fact memory catches the classic "confident but inconsistent" failure mode where individual steps look correct in isolation but contradict one another. The production verifier is an ensemble of complementary checks (symbolic, energy-based, and consistency-based) AND-composed so a candidate must pass all of them.
Carnot is developed through an autonomous research loop: a planner generates experiment proposals, an inner agent runs each experiment, and an adversarial-verify pass catches fabricated or methodology-incomplete artifacts before they influence headline claims. Every experiment artifact is stored as a structured JSON record with reproducibility metadata.
Carnot exposes its verifier and repair capabilities through
multiple integration surfaces: a Python API, a command-line
tool, an MCP server, and an HTTP REST endpoint. Verdicts are
structured VerdictRecord objects. Streaming
verification, candidate-level reranking, and portable
SessionMemory packs (export, import, diff, merge)
let callers integrate the framework without lock-in to a
single surface.
A dynamic budget controller that scales Test-Time Compute (TTC) based on Process-Reward Energy Model (PREM) variance. This provides intrinsic motivation for continuous self-learning.
Every number below is backed by a checked-in experiment artifact. Model benchmark rows use live GPU inference on public models; infrastructure, hardware, ensemble, synthetic-pilot, and adversarial-audit rows are labeled by provenance. Synthetic pilots are included only when the card says so; they are not model-generation headline claims.
An architectural framework for energy-based verification of LLM output. The current draft anchors its claims to checked-in experiment artifacts and labels every benchmark number with provenance. The arXiv submission is prepared but pending operator-initiated upload.
Install with pip, point it at an LLM response, and get back a list of specific violations — plus, if you want, a repaired answer. Works with any model you can call; the examples below use a small public Qwen checkpoint so you can run them on a single GPU.
from carnot.pipeline import VerifyRepairPipeline
# Verify any LLM output in 3 lines
pipeline = VerifyRepairPipeline()
result = pipeline.verify("What is 47 + 28?", "47 + 28 = 76")
print(result.verified) # False
print(result.violations) # [47 + 28 = 76 (correct: 75)]
# Auto-repair with LLM feedback loop
fixed = pipeline.verify_and_repair(
"What is 47 + 28?", model="Qwen/Qwen3.5-0.8B"
)
print(fixed.final_response) # "47 + 28 = 75"
→ Walk through the 30-minute tutorial to build a hallucination-resistant verify-and-repair function end-to-end.
Longer-form notes from the Carnot project on energy-based verification, self-distillation theory, and what it has been like running the verifier stack on its own development.
We paid for a hostile audit of our paper draft. Seven fatal findings. Three rescue measurements. Two retractions, one rescue. The narrower paper we ended up with is one we can actually defend.
Operations · 2026-05-22 · ~10 minA verifier with a docstring citing arXiv:2601.18753 NTK turned out to be 56 lines of regex. Another sleep-padded its wall-clock to escape our detector. The three-layer defense we shipped.
Lessons · 2026-05-22 · ~11 minThree rigorous theory rounds approved the architecture. A single blind-spot audit pass found five fatal flaws, one of them outside the eight categories we'd thought to ask about.
Operations · 2026-05-22 · ~9 minOur autonomous research loop produced an artifact claiming a complete evaluation of a 30-billion-parameter language model in 95 microseconds. What we found, and the seven-rule detector we shipped to stop it.
Methodology · 2026-05-22 · ~8 minA self-improving system that reads from its own past outputs blurs the line between architectural capability and what it has memorized. The cleanest way to keep the line visible is to publish both numbers.
Theory · 2026-04-29 · ~10 minA counterintuitive result from verifier-filtered self-distillation: the better your verifier, the less information it gives you. To sculpt a model toward truth, your verifier must make mistakes.
Operations · 2026-04-29 · ~7 min639 experiments self-verified. 65 brace bugs auto-fixed. Zero false positives. What 26 days of running Carnot on its own development tells us about constraint-based code analysis in production.