Experiments & Reproduction¶
This page front-loads the current, supported reproduction entry points (benchmarks + reference deployment), and keeps older paper-era workflows in a clearly labeled archive section.
Current reproduction entry points (recommended)¶
1) Benchmark harness (DOSE / RF / IK)¶
Run the shipped, auditable benchmark domains via the first-class harness:
python -m zeroproofml.benchmarks dose --mode smoke --device cpu --seeds 1
python -m zeroproofml.benchmarks rf --mode smoke --device cpu --seeds 1
python -m zeroproofml.benchmarks ik --mode smoke --device cpu --seeds 1
Expected (qualitative) runtime:
- smoke: minutes on CPU (depends on machine).
- paper: longer, intended for scheduled/manual runs.
Expected outputs:
- A run directory under results/benchmarks/<domain>/run_<timestamp>_<sha>/ containing:
- manifest.json
- provenance.json
- aggregated/summary.json + aggregated/summary.md
- aggregated/paired_stats.json + aggregated/paired_stats.md
- CLAIM_AUDIT.md
2) Reference deployment (robotics RR IK)¶
End-to-end reference path (train → bundle → strict inference → fallback → operator report):
python scripts/reference_robotics_deployment.py --device cpu --epochs 1 --n-samples 2000
Expected outputs:
- A run directory under results/reference_deploy_robotics/ containing at least:
- bundle/model.onnx
- bundle/metadata.json
- VALIDATION_REPORT.md
Comparing runs across commits (provenance-first)¶
Each benchmark run writes a provenance.json that records the git commit SHA, package versions, arguments,
dataset identifiers, and metric definitions for that run.
To compare a new run against a prior baseline, pass the baseline run's aggregated/summary.json:
python -m zeroproofml.benchmarks dose --mode smoke --device cpu --seeds 1 \
--baseline-summary results/benchmarks/dose/<baseline_run>/aggregated/summary.json
This causes aggregated/paired_stats.json / aggregated/paired_stats.md to include baseline deltas.
Archive / prior paper workflow (not recommended)¶
The following scripts remain in-repo for historical reference, but are not the primary reproduction path:
scripts/run_paper_260116.shscripts/EXPERIMENTAL_SUITE_README.mdscripts/paper_vps/(older claim-audit and paper-specific post-processing)
Prefer the benchmark harness and reference deployment above for current, auditable runs.