Benchmark AI Agents on
Property-Based Bug Discovery
PBT-Bench evaluates whether AI agents can discover hidden semantic bugs in Python libraries using property-based testing with Hypothesis — guided solely by official API documentation.
A different kind of benchmark
Most code benchmarks ask AI to fix a known bug or pass visible test gaps. PBT-Bench asks a harder question: can an agent discover bugs that no one told it about?
Bugs Invisible to Inspection
Every injected bug survives a 10-minute code review and passes all existing unit tests. Discovery requires systematic semantic reasoning, not grepping the source.
Documentation-Driven
Agents receive official API documentation as their primary oracle. Bugs hide in the gap between what the docs promise and what the implementation delivers.
Automated F→P Evaluation
The Fail-to-Pass criterion requires zero human judgment: a test must FAIL on the buggy library and PASS on the fixed version. Fully reproducible by anyone.
Hidden Ground Truth
Reference property tests are never shown to agents. Any test function that independently achieves F→P for a bug counts as a discovery.
How It Works
Bug Injection
Semantic bugs are injected into real Python libraries via unified diff patches. The bugs are designed to be undetectable through code inspection and invisible to all existing unit tests — they only surface through systematic property testing.
Documentation as Oracle
The agent receives official API documentation for the target library. Its task is to infer semantic invariants from the docs: contracts the library promises to uphold. Source code is accessible but intentionally insufficient.
Property Test Design
The agent writes Hypothesis @given property tests
checking semantic contracts — roundtrip consistency,
commutativity, idempotency, algebraic laws, and more. Crafting
focused input strategies from documentation is the critical
challenge.
F→P Evaluation
Each test function is evaluated independently against each
(buggy_lib, fixed_lib) pair. A test "finds" a bug
when it FAILs on the buggy version and PASSes on the fixed
version. Bug recall is computed per problem.
Two Evaluation Tracks
Unconstrained Testing
The agent may use any approach: unit tests, code inspection, manual exploration, or arbitrary test generation. This measures what's achievable with current best practices — no Hypothesis required.
Property-Based Testing
The agent must use Hypothesis @given decorator with
custom input strategies. The key skill is deriving precise
strategies from API documentation rather than defaulting to
random inputs.
Problem Set
The current evaluated set spans 28 Python libraries across diverse domains. Each problem contains injected semantic bugs organized into four difficulty levels. Evaluated on Claude Sonnet 4.6 and GLM-5 across Baseline and PBT tracks.
Date & Time
Math & Scientific
Type Systems
Parsing & Documents
Data Structures
Config & Graphs
Encoding & Algorithms
Difficulty Levels
| Level | Description | Example trigger |
|---|---|---|
| L1 | Default Hypothesis strategies are sufficient to find the bug | Integer boundary overflow (uint16 sign error at [32768, 65535]) |
| L2 | A non-default, documentation-derived strategy is required | Size thresholds, specific parameter combinations |
| L3 | Cross-function state sequences or algebraic property chains | FSM callback ordering, precision thresholds, protocol conformance |
| L4 | External specification or engineering knowledge required | DER/OID encoding standards, RFC compliance checks |
Get Started
Requirements
- Python 3.12
- uv package manager
- Docker (for isolated evaluation environments)
- An LLM API key — Anthropic, OpenRouter, or compatible
1 — Clone and install dependencies
# Clone the repository
git clone https://github.com/ElliotXinqiWang/pbt-bench.git
cd pbt-bench
# Create a Python 3.12 virtual environment
uv venv .venv --python 3.12
# Install the evaluation framework (vendored OpenHands SDK)
uv pip install \
vendor/software-agent-sdk/openhands-sdk \
vendor/software-agent-sdk/openhands-tools \
vendor/software-agent-sdk/openhands-workspace \
vendor/software-agent-sdk/openhands-agent-server 2 — Configure your LLM
// eval/llm_config.json
{
"model": "anthropic/claude-sonnet-4-6",
"api_key": "your-api-key-here"
}
OpenRouter and local model configs are also supported — see
llm_configs/ in the repo for examples.
3 — Run the benchmark
# Edit MODE, MAX_WORKERS, N_LIMIT at the top of run_eval.sh
bash run_eval.sh
# Or run a single track directly:
.venv/bin/python3 eval/run_pbt.py eval/llm_config.json \
--n-limit 5 --max-iterations 40 --note my_first_run Output
Results are written to eval_outputs/ as streaming JSONL — crash-safe and
resumable. Each run includes per-problem agent traces, test files written by the agent,
and a summary of F→P counts per bug.