MIT-0 · Open Source

Catch the reasoning errors your LLM states with total confidence.

Large language models sometimes produce answers that sound right but aren't. Carnot is an open-source tool that checks whether an answer is internally consistent — and suggests a fix when it isn't. Available on PyPI (pip install carnot-ebm) and mirrored on HuggingFace.

Start the tutorial Get Started How it works

4,958

Experiment runs

382

Completed milestones

30,658

Automated tests

0.9131

Verifier AUROC — FoVer math step-errors (5-seed)

Recent progress

Carnot is installable via PyPI and mirrored on HuggingFace

The verify-repair pipeline ships as pip install carnot-ebm, with model weights mirrored at huggingface.co/Carnot-EBM. The verifier ensemble reaches 0.9131 AUROC on FoVer (5-seed dual-condition; architecture-only 0.8947). Repinned from v2 0.9857 after pre-submission adversarial audit; see Why We Report Two AUROCs Now. The canonical source repository is github.com/Carnot-EBM/carnot-ebm.

The Problem

LLMs predict. They don't check.

An LLM writes one word at a time, always picking the most likely next word. That's great for fluent language. It's not designed to ask "wait, does this answer add up?"

Ask an LLM for 47 + 28. It might say "47 + 28 = 76" with total confidence. The arithmetic is wrong, but nothing in the generation process noticed. Carnot is the second pair of eyes.

One-word-at-a-time

An LLM that's already committed to "47 + 28 = 7" can only pick the next most-likely character — not go back and reconsider the whole answer.

Whole-answer check

Carnot reads the complete answer, pulls out the claims it makes ("47 + 28 = 76"), checks those claims, and tells you which ones don't hold up.

How it works

Extract → Check → Repair

Three steps. No model fine-tuning required — Carnot works with any LLM you can call (Claude, GPT, Qwen, Gemma, local Llama).

Extract the claims

Carnot reads the LLM's answer and pulls out the specific claims it's making: arithmetic equations, type assertions, code behaviours, cited facts. For code, this is straightforward parsing. For maths, we use patterns and a second small LLM pass. These become the constraints we'll check.

Check them

Each claim gets checked the right way for its kind: equations run through a formal solver, code runs through property-based tests, claims that can contradict each other get compared. Every check produces a single energy score — low means the claim is consistent, high means something is wrong.

Repair

If the total energy is high, Carnot sends the specific violations back to the LLM as feedback and asks for a fix. The loop runs until the answer checks out or a budget is spent. No violations, no repair needed — the original answer is returned unchanged.

Why “Energy-Based”?

The framework beneath all this comes from physics. An Energy-Based Model assigns a single number (energy) to a candidate answer, where low energy means valid and high energy means invalid. This gives a scoring rule that does not need per-question tuning, plus a gradient that the repair step can descend on. The same mathematics maps onto specialized hardware (FPGA Ising machines, thermodynamic samplers), which is why Carnot can scale beyond conventional CPUs.

What you can check

Seven capabilities, one framework

Carnot ships with a routed verifier stack tuned for different kinds of LLM output. Pick one, or let the pipeline route automatically.

Code

Does the function actually do what it says on the tin? Carnot runs property-based tests generated from the function signature, plus any official tests you already have. When the LLM writes code that passes the official tests but fails on edge cases, we catch it. Tested on the 164-problem HumanEval benchmark: 99.3% of wrong code flagged, repair pushes pass-rate up by 3 points.

Maths & arithmetic

Every equation in the answer gets extracted and checked by a formal solver (Z3). Wrong sum, bad algebra, inconsistent units — flagged.

Typed constraints

For structured outputs (JSON, function arguments, tool calls), Carnot checks type shapes and value bounds. +4.9 points on a typed-constraint benchmark over the raw model, and 4% → 12% completion on the CCTU constrained tool-use micro-benchmark.

Multi-step reasoning

For chain-of-thought answers, each step is checked against the ones that came before. A verified-fact memory catches the classic "confident but inconsistent" failure mode where individual steps look correct in isolation but contradict one another. The production verifier is an ensemble of complementary checks (symbolic, energy-based, and consistency-based) AND-composed so a candidate must pass all of them.

Research operations

Carnot is developed through an autonomous research loop: a planner generates experiment proposals, an inner agent runs each experiment, and an adversarial-verify pass catches fabricated or methodology-incomplete artifacts before they influence headline claims. Every experiment artifact is stored as a structured JSON record with reproducibility metadata.

APIs & portable memory

Carnot exposes its verifier and repair capabilities through multiple integration surfaces: a Python API, a command-line tool, an MCP server, and an HTTP REST endpoint. Verdicts are structured VerdictRecord objects. Streaming verification, candidate-level reranking, and portable SessionMemory packs (export, import, diff, merge) let callers integrate the framework without lock-in to a single surface.

Test-Time Compute (TTC) & PREM

A dynamic budget controller that scales Test-Time Compute (TTC) based on Process-Reward Energy Model (PREM) variance. This provides intrinsic motivation for continuous self-learning.

Evidence

What we measured

Every number below is backed by a checked-in experiment artifact. Model benchmark rows use live GPU inference on public models; infrastructure, hardware, ensemble, synthetic-pilot, and adversarial-audit rows are labeled by provenance. Synthetic pilots are included only when the card says so; they are not model-generation headline claims.

Code — Verify-and-repair on HumanEval

+3.0 points on pass-rate

Typed constraints (structured outputs)

+4.9 points on compliance

Safety — Prompt-injection classifier

0.91 AUROC (publication gate)

Hardware — KV260 FPGA prototype

Ising sampler live on silicon

Training — Two-GPU parallel retrain

2.0× speedup, identical losses

Code repair — Ising-guided fuzzing on HumanEval-50

66% → 84% pass rate (+18pp)

Math reasoning — EstimationVerifier SVAMP AUC

0.90 AUC (vs 0.125 FoVer baseline)

Constrained decoding — CRANE vs rigid grammar, HumanEval-50

70% → 85% pass rate (+15pp)

Math extraction — VeriCoT equation-style CoT fix

GSM8K extraction TP rate: 0.5 → 1.0

Adversarial audit — PRM-BiasBench-style attacks

k=5 ensemble catches 60/60 attacks

Cascade routing — HalluGuard v3

0.0pp accuracy delta with 4.4% cost savings

Tool use — CCTU constrained micro-benchmark

Completion rate: 4% → 12%

Preprint

Carnot position paper (paper-v6)

An architectural framework for energy-based verification of LLM output. The current draft anchors its claims to checked-in experiment artifacts and labels every benchmark number with provenance. The headline FoVer verification AUROC (0.9131) has been independently re-computed from a clean continuous-integration checkout in a non-operator environment and landed within the published confidence interval — the central result is externally reproducible. The arXiv submission is prepared but pending operator-initiated upload.

Read paper (PDF) LaTeX source arXiv bundle

Try it

Install and verify in five lines

Install with pip, point it at an LLM response, and get back a list of specific violations — plus, if you want, a repaired answer. Works with any model you can call; the examples below use a small public Qwen checkpoint so you can run them on a single GPU.

from carnot.pipeline import VerifyRepairPipeline

# Verify any LLM output in 3 lines
pipeline = VerifyRepairPipeline()
result = pipeline.verify("What is 47 + 28?", "47 + 28 = 76")
print(result.verified)   # False
print(result.violations)  # [47 + 28 = 76 (correct: 75)]

# Auto-repair with LLM feedback loop
fixed = pipeline.verify_and_repair(
    "What is 47 + 28?", model="Qwen/Qwen3.5-0.8B"
)
print(fixed.final_response)  # "47 + 28 = 75"

use carnot_ising::{IsingModel, IsingConfig};
use carnot_samplers::langevin::{LangevinSampler, LangevinConfig};

// Initialize Ising model
let config = IsingConfig::new(1024);
let model = IsingModel::new(config);

// Compute energy
let energy = model.energy(&activations);
println!("Energy: {energy}");

// Sample via Langevin dynamics
let sampler = LangevinSampler::new(LangevinConfig::default());
let samples = sampler.sample(&model, 100);

→ Walk through the 30-minute tutorial to build a reasoning-error-catching verify-and-repair function end-to-end.