Apache 2.0 · Open Source

Catch the mistakes your LLM confidently makes up.

Large language models sometimes produce answers that sound right but aren't. Carnot is an open-source tool that checks whether an answer is internally consistent — and suggests a fix when it isn't. Available on PyPI (pip install carnot-ebm) and mirrored on HuggingFace.

2,635
Experiment runs
382
Completed milestones
25,608
Automated tests
0.9131
Verifier AUROC (5-seed dual-condition)
Recent progress

Carnot is installable via PyPI and mirrored on HuggingFace

The verify-repair pipeline ships as pip install carnot-ebm, with model weights mirrored at huggingface.co/Carnot-EBM. The verifier ensemble reaches 0.9131 AUROC on FoVer (5-seed dual-condition; architecture-only 0.8947). Repinned from v2 0.9857 after pre-submission adversarial audit; see Why We Report Two AUROCs Now. The canonical source repository is github.com/Carnot-EBM/carnot-ebm.

LLMs predict. They don't check.

An LLM writes one word at a time, always picking the most likely next word. That's great for fluent language. It's not designed to ask "wait, does this answer add up?"

Ask an LLM for 47 + 28. It might say "47 + 28 = 76" with total confidence. The arithmetic is wrong, but nothing in the generation process noticed. Carnot is the second pair of eyes.

!

One-word-at-a-time

An LLM that's already committed to "47 + 28 = 7" can only pick the next most-likely character — not go back and reconsider the whole answer.

+

Whole-answer check

Carnot reads the complete answer, pulls out the claims it makes ("47 + 28 = 76"), checks those claims, and tells you which ones don't hold up.

Extract → Check → Repair

Three steps. No model fine-tuning required — Carnot works with any LLM you can call (Claude, GPT, Qwen, Gemma, local Llama).

1

Extract the claims

Carnot reads the LLM's answer and pulls out the specific claims it's making: arithmetic equations, type assertions, code behaviours, cited facts. For code, this is straightforward parsing. For maths, we use patterns and a second small LLM pass. These become the constraints we'll check.

2

Check them

Each claim gets checked the right way for its kind: equations run through a formal solver, code runs through property-based tests, claims that can contradict each other get compared. Every check produces a single energy score — low means the claim is consistent, high means something is wrong.

3

Repair

If the total energy is high, Carnot sends the specific violations back to the LLM as feedback and asks for a fix. The loop runs until the answer checks out or a budget is spent. No violations, no repair needed — the original answer is returned unchanged.

E

Why “Energy-Based”?

The framework beneath all this comes from physics. An Energy-Based Model assigns a single number (energy) to a candidate answer, where low energy means valid and high energy means invalid. This gives a scoring rule that does not need per-question tuning, plus a gradient that the repair step can descend on. The same mathematics maps onto specialized hardware (FPGA Ising machines, thermodynamic samplers), which is why Carnot can scale beyond conventional CPUs.

Seven capabilities, one framework

Carnot ships with a routed verifier stack tuned for different kinds of LLM output. Pick one, or let the pipeline route automatically.

C

Code

Does the function actually do what it says on the tin? Carnot runs property-based tests generated from the function signature, plus any official tests you already have. When the LLM writes code that passes the official tests but fails on edge cases, we catch it. Tested on the 164-problem HumanEval benchmark: 99.3% of wrong code flagged, repair pushes pass-rate up by 3 points.

M

Maths & arithmetic

Every equation in the answer gets extracted and checked by a formal solver (Z3). Wrong sum, bad algebra, inconsistent units — flagged.

T

Typed constraints

For structured outputs (JSON, function arguments, tool calls), Carnot checks type shapes and value bounds. +4.9 points on a typed-constraint benchmark over the raw model, and 4% → 12% completion on the CCTU constrained tool-use micro-benchmark.

R

Multi-step reasoning

For chain-of-thought answers, each step is checked against the ones that came before. A verified-fact memory catches the classic "confident but inconsistent" failure mode where individual steps look correct in isolation but contradict one another. The production verifier is an ensemble of complementary checks (symbolic, energy-based, and consistency-based) AND-composed so a candidate must pass all of them.

O

Research operations

Carnot is developed through an autonomous research loop: a planner generates experiment proposals, an inner agent runs each experiment, and an adversarial-verify pass catches fabricated or methodology-incomplete artifacts before they influence headline claims. Every experiment artifact is stored as a structured JSON record with reproducibility metadata.

A

APIs & portable memory

Carnot exposes its verifier and repair capabilities through multiple integration surfaces: a Python API, a command-line tool, an MCP server, and an HTTP REST endpoint. Verdicts are structured VerdictRecord objects. Streaming verification, candidate-level reranking, and portable SessionMemory packs (export, import, diff, merge) let callers integrate the framework without lock-in to a single surface.

P

Test-Time Compute (TTC) & PREM

A dynamic budget controller that scales Test-Time Compute (TTC) based on Process-Reward Energy Model (PREM) variance. This provides intrinsic motivation for continuous self-learning.

What we measured

Every number below is backed by a checked-in experiment artifact. Model benchmark rows use live GPU inference on public models; infrastructure, hardware, ensemble, synthetic-pilot, and adversarial-audit rows are labeled by provenance. Synthetic pilots are included only when the card says so; they are not model-generation headline claims.

Code — Verify-and-repair on HumanEval

+3.0 points on pass-rate

Typed constraints (structured outputs)

+4.9 points on compliance

Safety — Prompt-injection classifier

0.91 AUROC (publication gate)

Hardware — KV260 FPGA prototype

Ising sampler live on silicon

Training — Two-GPU parallel retrain

2.0× speedup, identical losses

Code repair — IterativeSelfRepair (HumanEval-50, execute-feedback-retry)

8% → 80% pass rate (+72pp)

Math reasoning — EstimationVerifier SVAMP AUC

0.90 AUC (vs 0.125 FoVer baseline)

Live benchmark — SOTA 35B (Qwen3.6-35B-A3B, dual RTX 3090)

HumanEval pass@1: 0% → 36% after Carnot correction

Math extraction — VeriCoT equation-style CoT fix

GSM8K extraction TP rate: 0.5 → 1.0

Adversarial audit — PRM-BiasBench-style attacks

k=5 ensemble catches 60/60 attacks

Cascade routing — HalluGuard v3

0.0pp accuracy delta with 4.4% cost savings

Tool use — CCTU constrained micro-benchmark

Completion rate: 4% → 12%

Carnot position paper (paper-v6)

An architectural framework for energy-based verification of LLM output. The current draft anchors its claims to checked-in experiment artifacts and labels every benchmark number with provenance. The arXiv submission is prepared but pending operator-initiated upload.

Install and verify in five lines

Install with pip, point it at an LLM response, and get back a list of specific violations — plus, if you want, a repaired answer. Works with any model you can call; the examples below use a small public Qwen checkpoint so you can run them on a single GPU.

from carnot.pipeline import VerifyRepairPipeline

# Verify any LLM output in 3 lines
pipeline = VerifyRepairPipeline()
result = pipeline.verify("What is 47 + 28?", "47 + 28 = 76")
print(result.verified)   # False
print(result.violations)  # [47 + 28 = 76 (correct: 75)]

# Auto-repair with LLM feedback loop
fixed = pipeline.verify_and_repair(
    "What is 47 + 28?", model="Qwen/Qwen3.5-0.8B"
)
print(fixed.final_response)  # "47 + 28 = 75"

→ Walk through the 30-minute tutorial to build a hallucination-resistant verify-and-repair function end-to-end.

From the blog

Longer-form notes from the Carnot project on energy-based verification, self-distillation theory, and what it has been like running the verifier stack on its own development.

Lessons · 2026-05-23 · ~11 min

Two Retractions and a Rescue

We paid for a hostile audit of our paper draft. Seven fatal findings. Three rescue measurements. Two retractions, one rescue. The narrower paper we ended up with is one we can actually defend.

Operations · 2026-05-22 · ~10 min

Regex in an NTK Costume

A verifier with a docstring citing arXiv:2601.18753 NTK turned out to be 56 lines of regex. Another sleep-padded its wall-clock to escape our detector. The three-layer defense we shipped.

Lessons · 2026-05-22 · ~11 min

Five FATAL Findings Three Deep Think Rounds Missed

Three rigorous theory rounds approved the architecture. A single blind-spot audit pass found five fatal flaws, one of them outside the eight categories we'd thought to ask about.

Operations · 2026-05-22 · ~9 min

Caught Cheating: 95 Microseconds on a 30B Model

Our autonomous research loop produced an artifact claiming a complete evaluation of a 30-billion-parameter language model in 95 microseconds. What we found, and the seven-rule detector we shipped to stop it.

Methodology · 2026-05-22 · ~8 min

Why We Report Two AUROCs Now

A self-improving system that reads from its own past outputs blurs the line between architectural capability and what it has memorized. The cleanest way to keep the line visible is to publish both numbers.

Theory · 2026-04-29 · ~10 min

The Verifier Accuracy Paradox

A counterintuitive result from verifier-filtered self-distillation: the better your verifier, the less information it gives you. To sculpt a model toward truth, your verifier must make mistakes.

Operations · 2026-04-29 · ~7 min

Carnot Dogfooding by the Numbers

639 experiments self-verified. 65 brace bugs auto-fixed. Zero false positives. What 26 days of running Carnot on its own development tells us about constraint-based code analysis in production.

See all posts →