Experiments & Reproduction¶

This page front-loads the current, supported reproduction entry points (benchmarks + reference deployment), and keeps older paper-era workflows in a clearly labeled archive section.

Current reproduction entry points (recommended)¶

1) Benchmark harness (DOSE / RF / IK)¶

Run the shipped, auditable benchmark domains via the first-class harness:

python -m zeroproofml.benchmarks dose --mode smoke --device cpu --seeds 1
python -m zeroproofml.benchmarks rf --mode smoke --device cpu --seeds 1
python -m zeroproofml.benchmarks ik --mode smoke --device cpu --seeds 1

Expected (qualitative) runtime: - smoke: minutes on CPU (depends on machine). - paper: longer, intended for scheduled/manual runs.

Expected outputs: - A run directory under results/benchmarks/<domain>/run_<timestamp>_<sha>/ containing: - manifest.json - provenance.json - aggregated/summary.json + aggregated/summary.md - aggregated/paired_stats.json + aggregated/paired_stats.md - CLAIM_AUDIT.md

2) Reference deployment (robotics RR IK)¶

End-to-end reference path (train → bundle → strict inference → fallback → operator report):

python scripts/reference_robotics_deployment.py --device cpu --epochs 1 --n-samples 2000

Expected outputs: - A run directory under results/reference_deploy_robotics/ containing at least: - bundle/model.onnx - bundle/metadata.json - VALIDATION_REPORT.md

Comparing runs across commits (provenance-first)¶

Each benchmark run writes a provenance.json that records the git commit SHA, package versions, arguments, dataset identifiers, and metric definitions for that run.

To compare a new run against a prior baseline, pass the baseline run's aggregated/summary.json:

python -m zeroproofml.benchmarks dose --mode smoke --device cpu --seeds 1 \
  --baseline-summary results/benchmarks/dose/<baseline_run>/aggregated/summary.json

This causes aggregated/paired_stats.json / aggregated/paired_stats.md to include baseline deltas.

Archive / prior paper workflow (not recommended)¶

The following scripts remain in-repo for historical reference, but are not the primary reproduction path:

scripts/run_paper_260116.sh
scripts/EXPERIMENTAL_SUITE_README.md
scripts/paper_vps/ (older claim-audit and paper-specific post-processing)

Prefer the benchmark harness and reference deployment above for current, auditable runs.