Experiments & Reproduction¶
This page front-loads the current, supported reproduction entry points (benchmarks + reference deployment), and keeps older paper-era workflows in a clearly labeled archive section.
For the shortest published-docs rerun checklist, start at 31_reproduce_the_paper.md.
The frozen replay contract for the current paper-facing runs lives in artifacts/paper_2026/.
It records the exact commands, config snapshots, expected output paths, tolerance bands,
SHA256 inputs, and pinned CPU/GPU container recipes for the supported reruns below.
Native CPU reruns use requirements-paper.txt / requirements-bench.txt; GPU
reruns should use the pinned container recipe.
Top-level shortcuts mirror the frozen paper bundle:
make reproduce-paper
make reproduce-dose
make reproduce-rf
make reproduce-ik
make reproduce-reference-robotics
Override the default output root or device when needed, for example
make reproduce-dose REPRO_ROOT=results/tmp_repro REPRO_DEVICE=cpu.
If you need an isolated rerun environment, build one of the pinned container recipes
listed in artifacts/paper_2026/manifest.json and use the repo checkout as the
mounted /workspace volume inside that container.
Current reproduction entry points (recommended)¶
1) Scientific claim benchmark harness (DOSE / RF / IK)¶
Run the shipped, auditable benchmark domains via the first-class harness:
python -m zeroproofml.benchmarks dose --mode smoke --device cpu --seeds 1
python -m zeroproofml.benchmarks rf --mode smoke --device cpu --seeds 1
python -m zeroproofml.benchmarks ik --mode smoke --device cpu --seeds 1
Resume an interrupted paper-mode run with the same --out-root:
python -m zeroproofml.benchmarks dose --mode paper --device cpu \
--out-root results/benchmarks/dose/<run_dir> --resume
For explicit existing-directory reuse, use --skip-complete-seeds to fill only
missing canonical seed results, or --force-rerun to rerun the requested seeds
inside the same --out-root.
Add --html-report when you also want a browser-friendly RUN_REPORT.html
alongside the default RUN_REPORT.md.
To rebuild RUN_REPORT.md or add RUN_REPORT.html from an existing run
directory, use:
python -m zeroproofml.report benchmark results/benchmarks/dose/<run_dir> --html-report
The report command also writes figures/metric_summary.svg,
figures/per_seed_metric_distributions.svg, and, when paired baseline stats
exist, figures/baseline_delta.svg; for DOSE runs with
aggregated/dose_diagnostics.json, it additionally writes
figures/dose_figure_pack/ with threshold sweeps, macro-F1 vs finite-MAE
trade-offs, censor-direction confusion plots, assay-limit edge cases, and
operating-point bottom-provenance histograms when provenance split metrics are
available. Optional HTML reports embed those figures. Regenerated reports use
artifact-relative paths and refresh aggregated/summary.md from
aggregated/summary.json, so copied run directories produce the same report
content. When fault_rate / semantic_bottom_rate provenance metrics are
present, the standard report adds a fault-vs-semantic bottom breakdown.
Pass --baseline-run-dir <previous_run_dir> when the paired stats should be
recomputed against another benchmark run. The report CLI can also summarize
deployment bundles with python -m zeroproofml.report bundle <bundle_dir> and
JSONL training logs with python -m zeroproofml.report training-log <metrics.jsonl>.
Those modes write VALIDATION_REPORT.summary.svg and
<stem>_REPORT_metrics.svg, respectively. If Plotly is installed via
zeroproofml[interactive], the training-log path also writes
<stem>_REPORT.html with interactive metric traces. Bundle reports also
discover parent benchmark summaries and DOSE operating-point calibration
sidecars when the bundle lives inside a benchmark artifact directory.
Expected (qualitative) runtime:
- smoke: minutes on CPU (depends on machine).
- paper: longer, intended for scheduled/manual runs.
In GitLab CI, merge request pipelines keep these claim-benchmark:* jobs in smoke
mode; paper mode is exposed only through scheduled runs and manual jobs on the
default branch. That same scheduled/manual lane also runs a curated ONNX export
compatibility matrix across pinned Torch/ONNX/onnxruntime stacks, covering
tests/inference/test_export_bundle.py and
tests/inference/test_export_compatibility.py to catch bundle export/runtime drift.
Expected outputs:
- A run directory under results/benchmarks/<domain>/run_<timestamp>_<sha>/ containing:
- seed_*/per_seed_result.json (canonical per-seed raw result payloads)
- seed_*/rf_signal_traces.json (RF only; representative per-seed figure-regeneration traces)
- seed_*/rf_qualitative_figure_pack/ (RF only; targeted washout / invented-peak SVGs plus manifest/README)
- seed_*/dose_diagnostics.json (DOSE only)
- figures/dose_figure_pack/ (DOSE report regeneration only; domain-specific SVG pack plus manifest/README)
- manifest.json (versioned schema)
- provenance.json (versioned schema)
- resume_state.json (paper-mode resume state + attempt history cache)
- RUN_REPORT.md (RUN_REPORT.html too when --html-report is set)
- aggregated/summary.json + aggregated/summary.md
- aggregated/benchmark_metrics.jsonl
- aggregated/dose_operating_points.json + aggregated/dose_operating_points.md (DOSE only)
- aggregated/dose_pareto_front.json + aggregated/dose_pareto_front.md (DOSE only)
- aggregated/dose_diagnostics.json (DOSE only)
- aggregated/dirhead_only_summary.json (DOSE --mode frozen-dirhead only)
- aggregated/paired_stats.json + aggregated/paired_stats.md
- CLAIM_AUDIT.md
- figures/metric_summary.svg, figures/per_seed_metric_distributions.svg, and optional figures/baseline_delta.svg
For the full artifact glossary and schema marker map, see 40_benchmark_artifact_reference.md.
For DOSE runs, the diagnostics JSONs capture the richer artifact pack used to
inspect operating points: confusion matrices, tau_infer sweep curves,
denominator |Q| histograms, borderline threshold examples, and
censored-subset direction-quality summaries. Report regeneration turns those
diagnostics into a standardized DOSE figure pack under figures/dose_figure_pack/.
The harness also writes dose_pareto_front.{json,md} automatically so the
safety/coverage/direction trade-off summary is generated from the same aggregate
metrics as the operating-point report.
For RF runs, rf_signal_traces.json saves deterministic in-band plus extrapolation
frequency-response traces, and rf_frequency_response.svg turns those saved
traces into a committed Bode-style artifact so later plotting does not need to
retrain or reconstruct ad hoc trace selections; the saved figure also overlays
peak annotations, strict-trigger regions, and shared-denominator minima for the
traced models.
Each seed now also writes rf_qualitative_figure_pack/, which extracts saved
baseline failure cases into a small pack of washout and invented-peak SVGs for
paper writing and debugging.
2) Reference deployment (robotics RR IK)¶
End-to-end reference path (train → bundle → strict inference → fallback → operator report):
python scripts/reference_robotics_deployment.py --device cpu --epochs 1 --n-samples 2000
Expected outputs:
- A run directory under results/reference_deploy_robotics/ containing at least:
- output_contract.json (versioned reference-deployment artifact contract)
- inference_summary.json (basic strict rates plus hybrid_path_metrics, fallback_family_benchmark, and SCM-only vs unconstrained vs hybrid deltas; when diagnostic provenance is available it also records provenance_routing_comparison, provenance_routing_materiality, and provenance_benchmark_evidence)
- bundle/model.onnx
- bundle/metadata.json
- bundle/VALIDATION_REPORT.md
- bundle/VALIDATION_REPORT.summary.json
The same flow is also importable from Python via
zeroproofml.reference_robotics_deployment.run_reference_robotics_deployment(...),
and completed run directories can be reloaded with
load_reference_robotics_deployment_artifacts(...) when CI or downstream tooling
needs structured artifact objects instead of manually inspecting the directory
layout. load_reference_robotics_deployment_artifacts(...) now validates the
versioned output_contract.json when present and falls back to the legacy fixed
layout for older runs.
3) Synthetic trajectory-evaluation dataset (robotics RR IK)¶
Generate a stratified trajectory dataset for rollout or control-loop evaluation:
python scripts/generate_reference_robotics_trajectory_data.py \
--n-trajectories 48 --steps-per-trajectory 16
Expected outputs:
- results/reference_robotics_trajectory_eval/rr_trajectory_eval_dataset.json
- results/reference_robotics_trajectory_eval/rr_trajectory_eval_dataset_stress_tests/*.json
The generated JSON is stratified over four workspace quadrants and three
|det(J)| proximity bins (near, mid, far). Each trajectory stores
state-transition records with the current joint state, target workspace
waypoint, reference DLS action, next joint state, and per-step singularity
diagnostics. The generator also materializes per-stratum stress-test subset
JSONs, one for each (workspace_region, |det(J)| bin) pair, so targeted
rollout checks can be run without writing custom dataset filters.
To evaluate a policy in closed loop against that dataset, use the importable trajectory evaluator:
from zeroproofml.reference_robotics_trajectory_eval import (
evaluate_reference_robotics_trajectory_policy,
make_reference_robotics_dls_policy,
)
summary = evaluate_reference_robotics_trajectory_policy(
"results/reference_robotics_trajectory_eval/rr_trajectory_eval_dataset.json",
make_reference_robotics_dls_policy(damping=0.05),
)
print(summary["aggregate"]["mean_tracking_error"])
print(summary["by_start_singularity_bin"]["near"]["fallback_rate"])
print(summary["aggregate"]["joint_limit_violation_rate"])
Custom policies receive the current RR state, target waypoint, closed-loop
command, and reference DLS action for each step, so batch point predictors can
be replayed as trajectory-level controllers instead of being measured only on
independent points. The returned summary also includes
fallback_rate_by_step_index traces plus aggregate/per-bin
joint_limit_violation_*, chattering_event_*, and
policy_latency_budget_violation_* fields. If a policy tags fallback kinds
with fault or semantic, those traces also expose provenance-aware splits.
For workspace plots, zeroproofml.utils.viz.plot_workspace_rate_heatmaps(...)
bins 2D workspace points into bottom/gap/fallback rate heatmaps and colors
bottom/fallback panels by fault-vs-semantic provenance when those diagnostics
are available. plot_2d_mask_map(...) and plot_3d_mask_map(...) scatter raw
bottom/gap decisions over 2D or 3D sample coordinates,
plot_route_to_solver_overlay(...) provides the corresponding route-vs-reject
scatter overlay for fallback decisions, plot_fallback_route_timeline(...)
plots per-batch route/reject rates with provenance tags,
plot_monitoring_batch_summary(...) plots exported monitor batch summaries,
and
plot_detj_stratified_metrics(...) turns per-step fallback masks or tracking
errors into |det(J)| bucket summaries that line up with the dataset's
near / mid / far strata.
Regenerating an IK benchmark report also writes
figures/robotics_figure_pack/ from the saved RR IK seed artifacts when they
are present, covering workspace heatmaps, |det(J)| error/fallback plots,
route-to-analytic-solver maps, and fallback timelines.
Downstream pipeline simulator (composability)¶
For multi-step composability experiments, use the importable downstream pipeline simulator:
from pathlib import Path
from zeroproofml.downstream_pipeline import (
DownstreamPipelineReferenceSample,
DownstreamPipelineSample,
build_downstream_pipeline_simulator,
compare_downstream_pipeline_strategies,
write_downstream_pipeline_report,
write_downstream_pipeline_visualization_pack,
)
simulator = build_downstream_pipeline_simulator(
"5-step",
drop_reject_flag_probability=0.05,
bad_downstream_behaviors=("json_roundtrip", "aggregate_mean"),
default_fill_value=0.0,
roundtrip_digits=4,
)
result = simulator.simulate(
[
DownstreamPipelineSample(
decoded=(float("nan"),),
reject_flag=True,
provenance="semantic",
sample_id="strict_bottom",
),
]
)
print(result.stage_summaries[-1].unsafe_finite_accept_rate)
print(result.stage_summaries[-1].downstream_decision_accuracy_rate)
comparison = compare_downstream_pipeline_strategies(
[
DownstreamPipelineReferenceSample(
decoded=(-3.0,),
should_reject=True,
provenance="semantic",
direction_label="below",
sample_id="censored_low",
),
],
simulator,
)
print(comparison.by_strategy()["strict SCM + direction head"].stage_summaries[-1].direction_label_fidelity_rate)
report_md, report_html = write_downstream_pipeline_report(
Path("artifacts/composability"),
result=comparison,
include_html=True,
)
figure_pack = write_downstream_pipeline_visualization_pack(
Path("artifacts/composability"),
result=comparison,
)
print(report_md, report_html)
print(figure_pack["stage_transition_diagram"])
The built-in 1-step, 3-step, and 5-step variants provide a small
experimental harness for studying whether NaN-carried bottoms, reject flags,
provenance labels survive downstream handoffs. compare_downstream_pipeline_strategies(...)
now runs the same simulator across the built-in scalar-only baseline,
abstention/uncertainty baseline, strict SCM, and strict SCM + direction-head
encodings. Each stage summary records finite-payload rates, reject-flag
fidelity, propagated reject/censor/singularity fidelity, downstream decision
accuracy, a simple corruption-calibration proxy derived from payload/flag
consistency, provenance survival, direction-label fidelity for direction-head
comparisons, and unsafe-finite-accept rates relative to the reference reject
set. Stage configs now also expose built-in "bad downstream behavior" toggles
for nan_to_num, missing-flag drops, clipping, default fills, JSON/CSV
round-trips, and aggregate_mean reductions. write_downstream_pipeline_report(...)
turns either a single simulation or a multi-strategy comparison into
DOWNSTREAM_PIPELINE_REPORT.md, DOWNSTREAM_PIPELINE_REPORT.json, and optional
DOWNSTREAM_PIPELINE_REPORT.html artifacts. The JSON sidecar stores
machine-readable per-stage sample snapshots for later analysis, while the human
reports keep the stage-by-stage loss summaries that explicitly split fault vs
semantic provenance when those labels are present.
build_downstream_pipeline_stage_summary_rows(...) and
write_downstream_pipeline_stage_summaries(...) provide flat per-stage JSON/CSV
tables for downstream comparisons. write_downstream_pipeline_visualization_pack(...)
writes composability_figure_pack/ with stage_transition_diagram.svg,
failure_propagation_diagram.svg, corruption_sensitivity_curves.svg, a
manifest, and the same stage-summary JSON/CSV payloads so composability failures
are inspectable without opening a notebook.
4) examples/robotics/ inventory (status map)¶
No path under examples/robotics/ is currently a stable example. The
supported robotics entry points for current work are the benchmark harness,
scripts/reference_robotics_deployment.py, and the importable
zeroproofml.reference_robotics_* helpers.
| Path | Label | Why it is labeled this way |
|---|---|---|
examples/robotics/__init__.py |
archive | Package marker for a mostly legacy example folder; not a supported import surface by itself. |
examples/robotics/demo_rr_ik.py |
archive | Paper-era RR demo built on the legacy zeroproof/TR stack and kept only as a historical walkthrough. |
examples/robotics/rr_ik_quick.py |
archive | Convenience wrapper around the legacy RR trainer; useful for ad hoc reruns, but outside the current supported reproduction contract. |
examples/robotics/rr_ik_train.py |
archive | Large paper-era RR training script used by older ablation helpers, not by the current benchmark or reference-deployment APIs. |
examples/robotics/rr_ik_dataset.py |
benchmark input | Legacy wrapper that still delegates to the maintained RR IK benchmark dataset generator in zeroproofml.benchmarks.domains.rr_ik_dataset. |
examples/robotics/rrr_ik_dataset.py |
experimental example | Optional 3R dataset generator used for extra evidence paths, but not promoted to the supported benchmark/reference robotics surface. |
examples/robotics/rrr_ik_train.py |
experimental example | Optional 3R training/eval script for the same extra-evidence path; experimental and not covered by the stable benchmark contract. |
examples/robotics/ik6r_dataset.py |
experimental example | Optional 6R dataset generator used by legacy paper-compatible runners, with no current first-class benchmark/API promotion. |
examples/robotics/ik6r_train.py |
experimental example | Optional 6R training/eval script paired with the 6R generator; kept available for exploratory evidence rather than stable support. |
Decision: keep examples/robotics/rrr_ik_* and examples/robotics/ik6r_*
as experimental examples, not first-class experimental integrations. They
remain available for exploratory or paper-compatible reruns, but promotion is
deferred until they have a maintained zeroproofml.* import surface, a
versioned artifact contract, and targeted CI coverage comparable to the RR
benchmark/reference robotics paths.
Canonical naming for the remaining robotics example families now follows the
script stem: rr_ik_* for the benchmark-input RR wrapper, rrr_ik_* for the
3R example, and ik6r_* for the 6R example. In practice that means the 3R
example now defaults to data/rrr_ik_dataset.json and
results/robotics/rrr_ik/rrr_ik_results.json, while the 6R example keeps
data/ik6r_dataset.json and results/robotics/ik6r/ik6r_results.json.
The experimental 3R/6R dataset generators also accept the RR-style
--bucket-edges and --no-ensure_buckets_nonzero flags, while retaining the
older underscore spellings for compatibility with legacy scripts.
Comparing runs across commits (provenance-first)¶
Each benchmark run writes a versioned manifest.json plus a versioned provenance.json that record
the git commit SHA, package versions, arguments, dataset identifiers/fingerprints, metric definitions,
OS / CPU / GPU model details, Python plus backend package versions (torch, onnx,
onnxruntime, and installed optional backends), SHA256 hashes for saved checkpoints,
discovered bundle directories, post-processed summaries,
and whether the git worktree was already dirty when the run began. If a paper-mode run is resumed,
resume_state.json preserves the original startup record and provenance.json["resume"] records
both the original and resumed attempts.
To compare a new run against a prior baseline, pass the baseline run's aggregated/summary.json:
python -m zeroproofml.benchmarks dose --mode smoke --device cpu --seeds 1 \
--baseline-summary results/benchmarks/dose/<baseline_run>/aggregated/summary.json
This causes aggregated/paired_stats.json / aggregated/paired_stats.md to include baseline deltas.
Regenerating the report after that also writes
figures/baseline_delta.svg from those paired stats.
For Python workflows that need one or more baselines without hand-written
summary loaders, use the run-loading API:
from zeroproofml.benchmarks import compare_benchmark_runs, load_benchmark_run
run = load_benchmark_run("results/benchmarks/dose/<new_run>")
comparison = compare_benchmark_runs(
run,
[
"results/benchmarks/dose/<baseline_run_a>",
"results/benchmarks/dose/<baseline_run_b>",
],
)
print(comparison.to_dict())
Archive / prior paper workflow (not recommended)¶
The following scripts remain in-repo for historical reference, but are not the primary reproduction path:
scripts/run_paper_260116.shscripts/EXPERIMENTAL_SUITE_README.mdscripts/paper_vps/(older claim-audit and paper-specific post-processing)
Prefer the benchmark harness and reference deployment above for current, auditable runs.