Carnot Dogfooding by the Numbers

2026-04-29 ~7 min read

For 26 days, every experiment that Carnot's research conductor generates has passed through Carnot's own constraint-extraction and Z3 arithmetic verification before commit. We just looked at the numbers from that loop. Here is what 639 self-verified experiments tell us about constraint-based code analysis in production.

The numbers

639

Experiments self-verified

Brace bugs auto-fixed

False-positive code violations

False-positive Z3 violations

Two numbers worth noting: the real bugs caught (65 brace-pattern fixes) and the false positives raised (zero). On 639 experiments. Across 26 days.

What dogfooding means here

Carnot is, among other things, an open-source framework for verifying LLM-generated code using constraint-based energy models — Z3 SMT for arithmetic claims, type-system checks for structural correctness, property-based fuzzing for semantic execution, and a few more layers we will write about elsewhere.

The unusual thing about Carnot is that it generates a lot of its own code. The project's autoresearch conductor (scripts/research_conductor.py) drives an LLM-based research loop: each milestone, the conductor proposes ~10-15 experiments, an agent (Sonnet by default, Opus for harder cases) writes each experiment as a Python script, the script runs, results are committed, and the cycle repeats.

This means the project produces a steady stream of LLM-generated Python and YAML that needs to actually work — not as a benchmark output, but as production code committed to main. Hallucinated function signatures, malformed YAML, broken arithmetic claims in docstrings: any of these breaks the next iteration of the loop.

So: we run Carnot's own verification stack on Carnot's own generated code, before commit. Eat your own cooking. The verification step lives at scripts/research_conductor.py:_dogfood_verify_generated_code() and runs after every successful experiment. The data we are about to discuss lives at ops/dogfood-memory.json.

What the pipeline checks

After each generated experiment commits, the conductor runs five checks in sequence. Each is cheap (a few hundred milliseconds) and persists state to the dogfood memory:

Brace validation on YAML prompts. The conductor uses templated YAML prompts with placeholders like {project_root} and {date}. If the LLM hallucinates a literal {some_other_thing} in a generated prompt, Python's .format() blows up at runtime. We pre-validate by attempting .format(project_root="/test", date="20260101"). If it raises, we auto-fix the offending braces by replacing {x} with (x).
CodeExtractor static analysis on new .py files. Walks the AST, extracts type signatures, return-value bounds, and call-site constraints. Compares against any constraints declared in docstrings.
Z3ArithmeticExtractor. Parses arithmetic claims in code comments and docstrings (e.g., "# this should be in [0, 1]"), encodes them as Z3 formulas, and proves the claim against the surrounding code. Zero false positives by construction — Z3 is a decision procedure, not a heuristic.
LLMConstraintExtractor. An LLM-based pass over docstrings for natural-language claims that don't reduce to Z3 arithmetic ("# this function is symmetric"). Lower precision than Z3 (1 false positive per ~91 in our tests), but catches things Z3 can't.
Constraint tracker update. Persists what we found to ops/dogfood-memory.json so future runs build on prior knowledge.

What 65 caught bugs look like

All 65 caught bugs were in category 1: brace-pattern errors in YAML-templated prompts. The pattern that recurs:

tasks:
  - id: exp1234-some-experiment
    prompt: |
      You are working on the Carnot framework.
      The goal is to compute {arbitrary_thing} from {input_data}.
      ...

The LLM, when generating a new experiment task, inserted natural-language placeholders like {arbitrary_thing} that look reasonable in context but break str.format() downstream when the conductor tries to substitute {project_root} and {date}.

The dogfooding pipeline catches each one before it can propagate. The auto-fix is conservative: replace each non-template brace with parentheses ({arbitrary_thing} → (arbitrary_thing)), so the prompt stays human-readable but no longer crashes .format().

Without this pipeline, every one of those 65 bugs would have shown up as a stillborn experiment in the next iteration — the conductor's .format() call would raise KeyError, the experiment would skip, and someone (us) would have to chase down the YAML manually. At one bug every ~10 experiments, that's roughly an interruption every 2-3 hours of conductor wall-clock.

The hidden cost of LLM-generated code is the long tail of low-grade bugs. Each one is trivial. The aggregate is operationally crippling. Constraint-based pre-flight verification is the right tool because the failures are repetitive and rule-based, not creative.

What zero false positives means

Across 639 experiments, the CodeExtractor, Z3ArithmeticExtractor, and LLMConstraintExtractor combined raised zero false positives that survived to a flagged commit. (We track these as code_violations, z3_violations, and llm_violations in the memory file. All three are at zero.)

This is partly by design. Z3 is a decision procedure, so its claims are formally provable; false positives there would be a bug in our encoding. The CodeExtractor checks types and bounds extracted from the code itself against assertions in the same code, so a violation is unambiguously real. The LLM extractor is the noisy one, but we tuned its threshold during early development to err toward false negatives (silent passes) rather than false positives (noisy interruptions).

The trade-off is that we know we miss some bugs — but we also know we don't generate operator interruptions. In a research loop where the conductor runs unattended overnight, that asymmetry is the right call.

What this isn't

We want to be careful not to overclaim. The 639/65/0/0 numbers say nothing about Carnot's verification capability on arbitrary code. Specifically:

The 639 experiments are mostly Carnot-generated, not arbitrary. They follow a template (scripts/experiment_template.py), have predictable structure, and exercise a narrow distribution of patterns. The dogfooding pipeline learned its checks from this distribution.
The Z3 arithmetic claims are usually simple. Things like this counter is non-negative or this index is in [0, n]. We have not yet verified anything as gnarly as a deep numerical proof.
The pipeline is not a CI guardrail. It runs after-the-fact on the conductor's own generated code, not on every PR. A separate pre-commit hook stack handles arbitrary-developer commits.
Operator-time bugs are still possible. The pipeline catches generated-code bugs. Operator-introduced bugs (this morning's verdict-reproducibility incident; an inverted AUROC implementation discovered yesterday) require different machinery.

Why we are publishing this

Foundation-model self-distillation discussions tend to focus on the largest, most visible parts: the model, the pretraining data, the alignment loop. Less visible — but operationally load-bearing — is the question of whether the system that generates code as part of training can verify its own output as a precondition for the next iteration.

Our forthcoming position paper argues that for verifier-filtered self-distillation to converge to truth, the verifier suite must satisfy three structural axioms (Logically Grounded Discrete Domains: discrete Boolean topology, epistemic orthogonality, absolute oracular grounding). The dogfooding pipeline is a sub-scale demonstration of the operational discipline this requires. 639 experiments verified at zero false-positive cost is what the discipline looks like in practice.

The operational rule: if your training loop generates code that the next iteration depends on, the same energy framework you apply to the model's outputs should apply to the loop's own internals. Eat your own cooking. The data above is what 26 days of doing that looks like.

What's next

Three directions we're tracking:

Recursive Phase 3 application. Once the multi-verifier rotation architecture lands (currently in change-proposal scope), apply it recursively to Carnot's own development loop — rotate the conductor through orthogonal verifier suites, measure $\Phi$ on Carnot's own output stream. The forthcoming position paper formalises this.
Reproducibility audit. Yesterday we observed a flagship verdict change between two reruns of the same experiment (different random seeds). The dogfooding pipeline doesn't currently catch verdict instability; a sibling proposal scopes that audit for the next milestone.
External adoption. The dogfood pipeline has been load-bearing for 26 days; we plan to package the Z3ArithmeticExtractor and CodeExtractor as a standalone library so other LLM-research projects can reuse the discipline.