Agentic Property-Based Testing

Get a coding agent to find bugs in your codebase by mining properties and testing them via Hypothesis.

Repository layout

src/                              All project code
├── orchestrators/                Pipeline entry points (python src/orchestrators/<name>.py …)
│   ├── run.py, formalize.py        Python pipeline (Stages 1 + 2)
│   └── run_c.py, formalize_c.py    C pipeline (Stages 1 + 2)
├── lib/                          Shared library modules (triviality_filter.py, c_oracle.py)
├── tools/                        Standalone CLIs (export_corpus, formalize_review, scoring,
│                                  setup_lean, dedupe_bug_reports, run_check_issues)
└── prompts/                      Claude Code agent slash-command prompts (.md)

inputs/                           Things you curate, pipelines consume
├── c-targets/                    C source under test + manifest.json
└── example_packages/             Python-pipeline package manifests

output/                           Things pipelines produce (mostly gitignored)
├── c-harnesses/<lib>/<call_id>/    Stage-1 C artifacts + Stage-2 oracle scripts
│                                     (.c/.sh/.py source-controlled; .so/_runner gitignored)
├── results/<pkg>/                  Python-pipeline bug reports, aux_files, logs
├── envs/                           virtualenvs (gitignored)
└── worker_*/, worker_c_*/          per-worker scratch (gitignored)

lean-project/                     Lean 4 workspace (Lake)
├── Spec/Properties/<pkg>/<id>/     Python-pipeline Lean translations
├── Spec/CProperties/<lib>/<id>/    C-pipeline Lean translations
└── *_formalization_results.jsonl   per-record outcomes (gitignored)

tests/                            Pytest test suite (+ conftest.py)
hypo-plugin/                      Claude Code plugin distribution
paper/                            Published-paper artifacts (frozen)
plan.md, README.md                Top-level docs

For the artifacts from the paper, including bug reports and rankings, see the paper directory. Note that the code that was used in the paper is slightly behind what is in the main folder. See paper/README.md for more details.

To see all the bugs our agent found, see our website.

To read the blog post on the Hypothesis website, see here.

Setup

Prerequisites

Python 3.10+ with pip on PATH.
uv for virtual environments and dependency installs (the orchestrators that build sandbox venvs still use venv/pip internally, but project-level installs use uv).
Claude Code — the agent is invoked as a Claude Code slash command, so the CLI must be on PATH. You also need either an active Claude Code subscription or an ANTHROPIC_API_KEY (recommended for larger sweeps and to reproduce the paper).
git for cloning. (Optional) elan is installed automatically by src/tools/setup_lean.py if you opt into the Lean pipeline; you do not need it preinstalled.

Install

git clone git@github.com:StanfordLeanClub/software-verification.git
cd software-verification

# Project venv + Python dependencies
uv venv .venv
uv pip install --python .venv/bin/python tqdm pytest hypothesis

# Install the /hypo slash command so Claude Code can find it.
# Either symlink/copy the prompt into ~/.claude/commands/ for global use,
# or into ./.claude/commands/ for project-local use.
mkdir -p ~/.claude/commands
cp src/prompts/hypo.md ~/.claude/commands/hypo.md

The orchestrators in src/orchestrators/ only use the Python standard library, so once Claude Code and python3/pip are on PATH they will run; the uv pip install above is required only for the tools under src/tools/ (scoring, corpus export, formalize review) and for the test suite under tests/.

Optional — Lean toolchain

Skip this unless you want to run the Stage-2 Python→Lean or C→Lean pipelines.

# Installs elan + the Mathlib oleans into lean-project/. Takes a few minutes.
python src/tools/setup_lean.py

Smoke test

# Run the agent against one stdlib function — should produce a bug report
# (or a clean run) under output/results/ in a few minutes.
claude "/hypo statistics.median"

To run the multi-package orchestrator instead, see Agent runner below.

Running the agent

The agent is a Claude Code command. You will need to have Claude Code installed to run it. You will need a subscription to Claude Code, or an API key (we recommend an API key if you are running it over a large number of packages, or to reproduce the paper).

The command is contained in src/prompts/hypo.md. You will need to place this file in the .claude/commands/ directory, which can either be in ~ or in whichever directory you are running the agent from. The agent can then be invoked with /hypo <target>.

You will need pytest, hypothesis, and the package you are testing installed.

The agent takes one argument, which is the target to test. This can be a file, a function, or a module. If no argument is given, it will test the entire codebase, i.e., the current working directory. You can pass whichever other arguments that Claude Code supports, like the model, permissions, etc.

Example usage:

claude "/hypo numpy"
claude "/hypo statistics.median" --model opus

You can also just start Claude Code, and then invoke the agent.

Agent runner

The src/orchestrators/run.py script is a wrapper around the agent to test multiple packages, in parallel. It is what was used in the paper. This script does not require any other requirements beyond the standard library (of course, you still need to have Claude Code installed). You need python3 and pip to be in your PATH.

Note that the runner operates at the module level.

The only required argument is the path to a json file containing the packages to test, and which modules to test within each package. It looks like:

{
    "pathlib": {
        "type": "stdlib",
        "modules": ["pathlib"]
    },
    "numpy": {
        "type": "pypi",
        "modules": ["numpy"]
    }
}

The keys in the json file are the package names, either the standard library name or the PyPI name. For standard library packages, specify "stdlib", and for PyPI packages, specify "pypi". This is important so the runner knows how to set up the virtual environment.

The runner takes two optional arguments:

--max-workers: the number of parallel workers to use. Default is 20.
--model: the model to use. Default is "opus".
--preinstall-workers: the number of parallel workers to use for setting up the virtual environments. Default is 10.

The runner will output all bug reports in the results/ directory.

Example usage:

python src/orchestrators/run.py packages.json

In the inputs/example_packages/ directory, there are some example package json files to test:

packages_mini.json: a mini set of modules to test (this took 6 minutes to run, with default settings)
packages_10k.json: top 10,000 pypi packages, with the main module and all submodules one level deep

The packages tested in the paper are in the paper/ directory.

How the agent works

The runner sets up virtual environments, with venv, for each package. Standard library packages just use the same virtual environment, and PyPI packages get their own virtual environment. The runner will also install pytest and hypothesis in each virtual environment. It does this in parallel, which is controllable; see the CLI arguments below.

It then then sets up directories, up to a specified number of maximum workers (see CLI arguments below), which is a "sandbox" for the agent to run in. It only has permission to edit files within this sandbox. Each worker directory also contains .claude/commands/hypo.md, so that the agent can run. The runner parallelizes across modules.

Note that the runner also checks if the module has already been tested, and skips it if so. So, you can easily resume a run by just running the runner again.

Security

The runner calls the agent with restricted permissions. It only has permission to read/write/edit files in the sandbox in which it is called, and it also has read permission to the virtual environment, so that it can read the source code of the package. Furthermore, it can only write/edit .py and .md files. The only bash commands it can run are python and pytest. Note that because of how the virtual environments are set up, the Python command will be python. Lastly, it also has access to the Todo and WebFetch tools.

You should still be careful with the runner, because running arbitrary code is dangerous!

Outputs

In the results/ directory, there will be a directory named after the package. Each of these will have the following structure:

bug_reports/
- <all bug reports written by the agent>
logs/
- claude_call_$id.json <the log of the Claude Code call corresponding to this id>
aux_files/
- $id
  - <all other files written by Claude Code during the Claude Code call corresponding to this id, e.g., Python files>
call_mappings.jsonl, with the following format:
- call_id: the Claude Code call id
- module: the module tested
- timestamp: date executed
- bug_reports: the filename of any bug reports in the bug_reports directory written by this Claude Code call
- aux_files_dir: the directory containing all files written by the agent during the Claude Code call corresponding to this id

Python → Lean translation + theorem corpus

A post-processing pass takes the properties /hypo already tested and produces, for each, a real Lean 4 def (not an opaque axiom) behaviourally equivalent to the Python original, plus one or more theorem statements about that def. The pipeline is designed as the autoformalisation half of a two-project plan whose other half is automated proving — every theorem here ends in := by sorry, and the deliverable is a (Python source, Lean def, theorem statement) corpus for downstream prover training.

What this IS: an equivalence-checked corpus. For every clean record, the Lean def has been run against ~200 Hypothesis-generated inputs and agrees with Python on all of them.

What this is NOT (full inventory in lean-project/README.md):

Behavioural equivalence on a finite sample, not a formal proof of Python-Lean equivalence.
Proofs are all sorry. Proving them is downstream work.
Functions involving IO, mutation, opaque library types, or non-JSON-serializable input domains become documented holes rather than being silently mis-translated.
Shipping verified Lean as the runtime (FFI) is deferred to future work.

Setting up Lean

# Project venv (uses uv):
uv venv .venv
uv pip install --python .venv/bin/python tqdm pytest hypothesis

# One-time Lean toolchain install (elan + Mathlib oleans, ~few minutes):
python src/tools/setup_lean.py

The pipeline

# Stage 1: existing PBT (finds bugs, leaves Hypothesis tests in aux_files/)
python src/orchestrators/run.py inputs/example_packages/packages_mini.json

# Stage 2: rigorous Python -> Lean translation + theorem proposal
python src/orchestrators/formalize.py results/                       # all packages
python src/orchestrators/formalize.py results/ --package json        # one package
python src/orchestrators/formalize.py results/ --max-workers 10 --model sonnet

# Stage 3 (optional): sample-review for theorem-statement correctness
python src/tools/formalize_review.py sample --n 20           # pick unreviewed
python src/tools/formalize_review.py show <call_id> <prop>   # display one
python src/tools/formalize_review.py mark <call_id> <prop> \
    --verdict correct --notes "..."                # record verdict
python src/tools/formalize_review.py stats                   # coverage + agreement

# Stage 4: flatten to a single training JSONL
python src/tools/export_corpus.py --output corpus_v1.jsonl   # clean records only

Outcomes are recorded to lean-project/formalization_results.jsonl with explicit status codes for every failure mode (translation_hole, non_json_input_domain, python_non_deterministic, translation_disagreement, all_theorems_hole, agent_unclear).

Spec.lean's import block is regenerated at the end of every run so cd lean-project && lake build builds every formalised property.

Per-record artifact layout

lean-project/Spec/Properties/<Pkg>/<CallId>/
├── <Fn>.lean                       Lean def + theorems (all sorry)
├── <Fn>_Runner.lean                JSON-IPC harness for the oracle
└── translation_hole_<Fn>.md        only if partial-translation gaps exist

output/results/<pkg>/aux_files/<call_id>/
├── translation_check_<fn>.py       Hypothesis-driven Python<->Lean oracle
├── translation_disagreement_<fn>.md   only on oracle failure
└── bug_report_<fn>_<ts>_<hash>.md  only if agent suspects a Python bug

A record is clean iff lake_build_ok && translation_check_ok && some non-trivial theorem. Downstream training (export_corpus.py --filter clean) consumes only these.

Design rationale

plan.md — current implementation plan (C-pipeline port)
lean-project/README.md — translation conventions + hole inventory

C → Lean translation + theorem corpus

A sister pipeline targets C functions instead of Python. Architecture is identical to the Python side; the difference is at the leaves:

Stage 1 (run_c.py + /hypo_c) reads the manifest at inputs/c-targets/manifest.json, dispatches one agent per (library, fn_name), and produces a fuzz-style harness (<fn>_runner_main.c), a build script (build.sh), a Hypothesis bootstrap (pbt_<fn>.py), and a fuzztest template (harness_<fn>.cc) under c-harnesses/<lib>/<call_id>/.
Stage 2 (formalize_c.py + /formalize_c) compiles the C harness, dispatches the agent to translate C → Lean, then independently verifies via lake build, the triviality filter, and a C↔Lean differential oracle (translation_check_<fn>.py).
Outcomes recorded to lean-project/c_formalization_results.jsonl; Lean files under lean-project/Spec/CProperties/<lib>/<call_id>/.

The pipeline ships with demo/gcd as a worked example end-to-end, plus a starter manifest of 9 candidate functions across 4 libraries. See plan.md section H for the full verification suite.

The C pipeline

# Stage 1: dispatch /hypo_c per (library, fn) in the manifest
python src/orchestrators/run_c.py inputs/c-targets/manifest.json --library demo --fn gcd
python src/orchestrators/run_c.py inputs/c-targets/manifest.json --max-workers 4    # full sweep

# Stage 2: translate C -> Lean, run the oracle
python src/orchestrators/formalize_c.py inputs/c-targets/manifest.json --library demo --fn gcd

# Both pipelines share the corpus exporter and review CLI.

Status enums for the C pipeline include c_non_deterministic, ub_in_strategy_domain, and build_failure in place of the Python pipeline's python_non_deterministic and non_json_input_domain. The language-independent statuses (formalized, translation_disagreement, translation_hole, all_theorems_hole, agent_unclear) are shared.

Ranking the bug reports

To score the bug reports, you can run python src/tools/scoring.py results/. This uses the rubric contained in that file, and passes it to Claude (not Claude Code, just the Claude API). This script outputs a CSV file containing the scores for each bug report, as well as the reasoning.

It takes the following arguments:

--retry-failures: if set, it will retry the bug reports that failed to score. This requires the CSV file to already exist, as it checks for failed scores in the CSV file.
reports_dir: the directory containing the bug reports to score. Default is "results/".
--max-workers: the number of parallel workers to use. Default is 20.
--model: the model to use. Default is "claude-opus-4-1" (note model names are different when using the Claude API directly)
--csv-path: the path to the CSV file to write the results to. Default is "scoring_results.csv".

Example usage:

python src/tools/scoring.py results/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Property-Based Testing

Repository layout

Setup

Prerequisites

Install

Optional — Lean toolchain

Smoke test

Running the agent

Agent runner

How the agent works

Security

Outputs

Python → Lean translation + theorem corpus

Setting up Lean

The pipeline

Per-record artifact layout

Design rationale

C → Lean translation + theorem corpus

The C pipeline

Ranking the bug reports

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.claude-plugin		.claude-plugin
hypo-plugin		hypo-plugin
inputs		inputs
lean-project		lean-project
output/c-harnesses/demo/b9c08d6c		output/c-harnesses/demo/b9c08d6c
paper		paper
src		src
tests		tests
.gitignore		.gitignore
HANDOFF.md		HANDOFF.md
README.md		README.md
plan.md		plan.md

Folders and files

Latest commit

History

Repository files navigation

Agentic Property-Based Testing

Repository layout

Setup

Prerequisites

Install

Optional — Lean toolchain

Smoke test

Running the agent

Agent runner

How the agent works

Security

Outputs

Python → Lean translation + theorem corpus

Setting up Lean

The pipeline

Per-record artifact layout

Design rationale

C → Lean translation + theorem corpus

The C pipeline

Ranking the bug reports

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages