Benchmarks¶
This page summarizes the benchmark utilities for the SCM stack.
Domain benchmarks (DOSE / RF / IK)¶
Run the shipped, auditable benchmark domains via the first-class harness:
python -m zeroproofml.benchmarks dose --mode smoke --device cpu
python -m zeroproofml.benchmarks rf --mode smoke --device cpu
python -m zeroproofml.benchmarks ik --mode smoke --device cpu
Reproduction landing page: 21_experiments.md.
Each run writes a self-contained directory under results/benchmarks/<domain>/ with:
- manifest.json
- provenance.json (versioned schema)
- aggregated/summary.json + aggregated/summary.md
- aggregated/paired_stats.json + aggregated/paired_stats.md
- CLAIM_AUDIT.md
Operating points in benchmarks¶
The DOSE domain exposes a clear operating-point trade-off:
- safety_first: prioritize low false in-range on censored (false_in_range_rate_on_censored)
- accuracy_first: prioritize low false censored on in-range (false_censored_rate_on_in_range)
- direction_aware: use the strict-gate + direction-head pattern to make bottom outputs actionable
Use a τ_infer sweep on cached |Q| (or a calibration set) to set the actual thresholds; see docs/06_inference_deployment.md.
Parity Runner (Coverage vs Accuracy)¶
Runs a small synthetic regression task and writes a coverage/accuracy curve:
python benchmarks/parity_runner.py --output benchmark_results
Outputs:
- benchmark_results/parity_report.json
- benchmark_results/coverage_accuracy.png
Performance Suite¶
Runs SCM microbenchmarks (arithmetic, layer forward/backward, inference-gap thresholds, etc.) and saves a timestamped JSON artifact:
python benchmarks/run_benchmarks.py --suite all --output benchmark_results
Output: benchmark_results/benchmarks_<timestamp>.json
CI Regression Gate¶
Validates a benchmark artifact directory/JSON and fails fast on malformed entries or obviously slow results:
python scripts/ci/benchmark_gate.py benchmark_results
Baseline Artifact (Optional)¶
Copies the newest benchmark JSON in a results directory to benchmarks/baseline.json:
python scripts/update_benchmark_baseline.py --src benchmark_results