Benchmarks¶

This page summarizes the benchmark utilities for the SCM stack.

Domain benchmarks (DOSE / RF / IK)¶

Run the shipped, auditable benchmark domains via the first-class harness:

python -m zeroproofml.benchmarks dose --mode smoke --device cpu
python -m zeroproofml.benchmarks rf --mode smoke --device cpu
python -m zeroproofml.benchmarks ik --mode smoke --device cpu

Reproduction landing page: 21_experiments.md.

Each run writes a self-contained directory under results/benchmarks/<domain>/ with: - manifest.json - provenance.json (versioned schema) - aggregated/summary.json + aggregated/summary.md - aggregated/paired_stats.json + aggregated/paired_stats.md - CLAIM_AUDIT.md

Operating points in benchmarks¶

The DOSE domain exposes a clear operating-point trade-off: - safety_first: prioritize low false in-range on censored (false_in_range_rate_on_censored) - accuracy_first: prioritize low false censored on in-range (false_censored_rate_on_in_range) - direction_aware: use the strict-gate + direction-head pattern to make bottom outputs actionable

Use a τ_infer sweep on cached |Q| (or a calibration set) to set the actual thresholds; see docs/06_inference_deployment.md.

Parity Runner (Coverage vs Accuracy)¶

Runs a small synthetic regression task and writes a coverage/accuracy curve:

python benchmarks/parity_runner.py --output benchmark_results

Outputs: - benchmark_results/parity_report.json - benchmark_results/coverage_accuracy.png

Performance Suite¶

Runs SCM microbenchmarks (arithmetic, layer forward/backward, inference-gap thresholds, etc.) and saves a timestamped JSON artifact:

python benchmarks/run_benchmarks.py --suite all --output benchmark_results

Output: benchmark_results/benchmarks_<timestamp>.json

CI Regression Gate¶

Validates a benchmark artifact directory/JSON and fails fast on malformed entries or obviously slow results:

python scripts/ci/benchmark_gate.py benchmark_results

Baseline Artifact (Optional)¶

Copies the newest benchmark JSON in a results directory to benchmarks/baseline.json:

python scripts/update_benchmark_baseline.py --src benchmark_results