Experiments & Reproduction

This page front-loads the current, supported reproduction entry points (benchmarks + reference deployment), and keeps older paper-era workflows in a clearly labeled archive section.

For the shortest published-docs rerun checklist, start at 31_reproduce_the_paper.md.

The frozen replay contract for the current paper-facing runs lives in artifacts/paper_2026/. It records the exact commands, config snapshots, expected output paths, tolerance bands, SHA256 inputs, and pinned CPU/GPU container recipes for the supported reruns below. Native CPU reruns use requirements-paper.txt / requirements-bench.txt; GPU reruns should use the pinned container recipe.

Top-level shortcuts mirror the frozen paper bundle:

make reproduce-paper
make reproduce-dose
make reproduce-rf
make reproduce-ik
make reproduce-reference-robotics

Override the default output root or device when needed, for example make reproduce-dose REPRO_ROOT=results/tmp_repro REPRO_DEVICE=cpu.

If you need an isolated rerun environment, build one of the pinned container recipes listed in artifacts/paper_2026/manifest.json and use the repo checkout as the mounted /workspace volume inside that container.

Current reproduction entry points (recommended)

1) Scientific claim benchmark harness (DOSE / RF / IK)

Run the shipped, auditable benchmark domains via the first-class harness:

python -m zeroproofml.benchmarks dose --mode smoke --device cpu --seeds 1
python -m zeroproofml.benchmarks rf --mode smoke --device cpu --seeds 1
python -m zeroproofml.benchmarks ik --mode smoke --device cpu --seeds 1

Resume an interrupted paper-mode run with the same --out-root:

python -m zeroproofml.benchmarks dose --mode paper --device cpu \
  --out-root results/benchmarks/dose/<run_dir> --resume

For explicit existing-directory reuse, use --skip-complete-seeds to fill only missing canonical seed results, or --force-rerun to rerun the requested seeds inside the same --out-root.

Add --html-report when you also want a browser-friendly RUN_REPORT.html alongside the default RUN_REPORT.md.

To rebuild RUN_REPORT.md or add RUN_REPORT.html from an existing run directory, use:

python -m zeroproofml.report benchmark results/benchmarks/dose/<run_dir> --html-report

The report command also writes figures/metric_summary.svg, figures/per_seed_metric_distributions.svg, and, when paired baseline stats exist, figures/baseline_delta.svg; for DOSE runs with aggregated/dose_diagnostics.json, it additionally writes figures/dose_figure_pack/ with threshold sweeps, macro-F1 vs finite-MAE trade-offs, censor-direction confusion plots, assay-limit edge cases, and operating-point bottom-provenance histograms when provenance split metrics are available. Optional HTML reports embed those figures. Regenerated reports use artifact-relative paths and refresh aggregated/summary.md from aggregated/summary.json, so copied run directories produce the same report content. When fault_rate / semantic_bottom_rate provenance metrics are present, the standard report adds a fault-vs-semantic bottom breakdown.

Pass --baseline-run-dir <previous_run_dir> when the paired stats should be recomputed against another benchmark run. The report CLI can also summarize deployment bundles with python -m zeroproofml.report bundle <bundle_dir> and JSONL training logs with python -m zeroproofml.report training-log <metrics.jsonl>. Those modes write VALIDATION_REPORT.summary.svg and <stem>_REPORT_metrics.svg, respectively. If Plotly is installed via zeroproofml[interactive], the training-log path also writes <stem>_REPORT.html with interactive metric traces. Bundle reports also discover parent benchmark summaries and DOSE operating-point calibration sidecars when the bundle lives inside a benchmark artifact directory.

Expected (qualitative) runtime: - smoke: minutes on CPU (depends on machine). - paper: longer, intended for scheduled/manual runs.

In GitLab CI, merge request pipelines keep these claim-benchmark:* jobs in smoke mode; paper mode is exposed only through scheduled runs and manual jobs on the default branch. That same scheduled/manual lane also runs a curated ONNX export compatibility matrix across pinned Torch/ONNX/onnxruntime stacks, covering tests/inference/test_export_bundle.py and tests/inference/test_export_compatibility.py to catch bundle export/runtime drift.

Expected outputs: - A run directory under results/benchmarks/<domain>/run_<timestamp>_<sha>/ containing: - seed_*/per_seed_result.json (canonical per-seed raw result payloads) - seed_*/rf_signal_traces.json (RF only; representative per-seed figure-regeneration traces) - seed_*/rf_qualitative_figure_pack/ (RF only; targeted washout / invented-peak SVGs plus manifest/README) - seed_*/dose_diagnostics.json (DOSE only) - figures/dose_figure_pack/ (DOSE report regeneration only; domain-specific SVG pack plus manifest/README) - manifest.json (versioned schema) - provenance.json (versioned schema) - resume_state.json (paper-mode resume state + attempt history cache) - RUN_REPORT.md (RUN_REPORT.html too when --html-report is set) - aggregated/summary.json + aggregated/summary.md - aggregated/benchmark_metrics.jsonl - aggregated/dose_operating_points.json + aggregated/dose_operating_points.md (DOSE only) - aggregated/dose_pareto_front.json + aggregated/dose_pareto_front.md (DOSE only) - aggregated/dose_diagnostics.json (DOSE only) - aggregated/dirhead_only_summary.json (DOSE --mode frozen-dirhead only) - aggregated/paired_stats.json + aggregated/paired_stats.md - CLAIM_AUDIT.md - figures/metric_summary.svg, figures/per_seed_metric_distributions.svg, and optional figures/baseline_delta.svg

For the full artifact glossary and schema marker map, see 40_benchmark_artifact_reference.md.

For DOSE runs, the diagnostics JSONs capture the richer artifact pack used to inspect operating points: confusion matrices, tau_infer sweep curves, denominator |Q| histograms, borderline threshold examples, and censored-subset direction-quality summaries. Report regeneration turns those diagnostics into a standardized DOSE figure pack under figures/dose_figure_pack/. The harness also writes dose_pareto_front.{json,md} automatically so the safety/coverage/direction trade-off summary is generated from the same aggregate metrics as the operating-point report. For RF runs, rf_signal_traces.json saves deterministic in-band plus extrapolation frequency-response traces, and rf_frequency_response.svg turns those saved traces into a committed Bode-style artifact so later plotting does not need to retrain or reconstruct ad hoc trace selections; the saved figure also overlays peak annotations, strict-trigger regions, and shared-denominator minima for the traced models. Each seed now also writes rf_qualitative_figure_pack/, which extracts saved baseline failure cases into a small pack of washout and invented-peak SVGs for paper writing and debugging.

2) Reference deployment (robotics RR IK)

End-to-end reference path (train → bundle → strict inference → fallback → operator report):

python scripts/reference_robotics_deployment.py --device cpu --epochs 1 --n-samples 2000

Expected outputs: - A run directory under results/reference_deploy_robotics/ containing at least: - output_contract.json (versioned reference-deployment artifact contract) - inference_summary.json (basic strict rates plus hybrid_path_metrics, fallback_family_benchmark, and SCM-only vs unconstrained vs hybrid deltas; when diagnostic provenance is available it also records provenance_routing_comparison, provenance_routing_materiality, and provenance_benchmark_evidence) - bundle/model.onnx - bundle/metadata.json - bundle/VALIDATION_REPORT.md - bundle/VALIDATION_REPORT.summary.json

The same flow is also importable from Python via zeroproofml.reference_robotics_deployment.run_reference_robotics_deployment(...), and completed run directories can be reloaded with load_reference_robotics_deployment_artifacts(...) when CI or downstream tooling needs structured artifact objects instead of manually inspecting the directory layout. load_reference_robotics_deployment_artifacts(...) now validates the versioned output_contract.json when present and falls back to the legacy fixed layout for older runs.

3) Synthetic trajectory-evaluation dataset (robotics RR IK)

Generate a stratified trajectory dataset for rollout or control-loop evaluation:

python scripts/generate_reference_robotics_trajectory_data.py \
  --n-trajectories 48 --steps-per-trajectory 16

Expected outputs: - results/reference_robotics_trajectory_eval/rr_trajectory_eval_dataset.json - results/reference_robotics_trajectory_eval/rr_trajectory_eval_dataset_stress_tests/*.json

The generated JSON is stratified over four workspace quadrants and three |det(J)| proximity bins (near, mid, far). Each trajectory stores state-transition records with the current joint state, target workspace waypoint, reference DLS action, next joint state, and per-step singularity diagnostics. The generator also materializes per-stratum stress-test subset JSONs, one for each (workspace_region, |det(J)| bin) pair, so targeted rollout checks can be run without writing custom dataset filters.

To evaluate a policy in closed loop against that dataset, use the importable trajectory evaluator:

from zeroproofml.reference_robotics_trajectory_eval import (
    evaluate_reference_robotics_trajectory_policy,
    make_reference_robotics_dls_policy,
)

summary = evaluate_reference_robotics_trajectory_policy(
    "results/reference_robotics_trajectory_eval/rr_trajectory_eval_dataset.json",
    make_reference_robotics_dls_policy(damping=0.05),
)
print(summary["aggregate"]["mean_tracking_error"])
print(summary["by_start_singularity_bin"]["near"]["fallback_rate"])
print(summary["aggregate"]["joint_limit_violation_rate"])

Custom policies receive the current RR state, target waypoint, closed-loop command, and reference DLS action for each step, so batch point predictors can be replayed as trajectory-level controllers instead of being measured only on independent points. The returned summary also includes fallback_rate_by_step_index traces plus aggregate/per-bin joint_limit_violation_*, chattering_event_*, and policy_latency_budget_violation_* fields. If a policy tags fallback kinds with fault or semantic, those traces also expose provenance-aware splits. For workspace plots, zeroproofml.utils.viz.plot_workspace_rate_heatmaps(...) bins 2D workspace points into bottom/gap/fallback rate heatmaps and colors bottom/fallback panels by fault-vs-semantic provenance when those diagnostics are available. plot_2d_mask_map(...) and plot_3d_mask_map(...) scatter raw bottom/gap decisions over 2D or 3D sample coordinates, plot_route_to_solver_overlay(...) provides the corresponding route-vs-reject scatter overlay for fallback decisions, plot_fallback_route_timeline(...) plots per-batch route/reject rates with provenance tags, plot_monitoring_batch_summary(...) plots exported monitor batch summaries, and plot_detj_stratified_metrics(...) turns per-step fallback masks or tracking errors into |det(J)| bucket summaries that line up with the dataset's near / mid / far strata. Regenerating an IK benchmark report also writes figures/robotics_figure_pack/ from the saved RR IK seed artifacts when they are present, covering workspace heatmaps, |det(J)| error/fallback plots, route-to-analytic-solver maps, and fallback timelines.

Downstream pipeline simulator (composability)

For multi-step composability experiments, use the importable downstream pipeline simulator:

from pathlib import Path

from zeroproofml.downstream_pipeline import (
    DownstreamPipelineReferenceSample,
    DownstreamPipelineSample,
    build_downstream_pipeline_simulator,
    compare_downstream_pipeline_strategies,
    write_downstream_pipeline_report,
    write_downstream_pipeline_visualization_pack,
)

simulator = build_downstream_pipeline_simulator(
    "5-step",
    drop_reject_flag_probability=0.05,
    bad_downstream_behaviors=("json_roundtrip", "aggregate_mean"),
    default_fill_value=0.0,
    roundtrip_digits=4,
)
result = simulator.simulate(
    [
        DownstreamPipelineSample(
            decoded=(float("nan"),),
            reject_flag=True,
            provenance="semantic",
            sample_id="strict_bottom",
        ),
    ]
)
print(result.stage_summaries[-1].unsafe_finite_accept_rate)
print(result.stage_summaries[-1].downstream_decision_accuracy_rate)

comparison = compare_downstream_pipeline_strategies(
    [
        DownstreamPipelineReferenceSample(
            decoded=(-3.0,),
            should_reject=True,
            provenance="semantic",
            direction_label="below",
            sample_id="censored_low",
        ),
    ],
    simulator,
)
print(comparison.by_strategy()["strict SCM + direction head"].stage_summaries[-1].direction_label_fidelity_rate)

report_md, report_html = write_downstream_pipeline_report(
    Path("artifacts/composability"),
    result=comparison,
    include_html=True,
)
figure_pack = write_downstream_pipeline_visualization_pack(
    Path("artifacts/composability"),
    result=comparison,
)
print(report_md, report_html)
print(figure_pack["stage_transition_diagram"])

The built-in 1-step, 3-step, and 5-step variants provide a small experimental harness for studying whether NaN-carried bottoms, reject flags, provenance labels survive downstream handoffs. compare_downstream_pipeline_strategies(...) now runs the same simulator across the built-in scalar-only baseline, abstention/uncertainty baseline, strict SCM, and strict SCM + direction-head encodings. Each stage summary records finite-payload rates, reject-flag fidelity, propagated reject/censor/singularity fidelity, downstream decision accuracy, a simple corruption-calibration proxy derived from payload/flag consistency, provenance survival, direction-label fidelity for direction-head comparisons, and unsafe-finite-accept rates relative to the reference reject set. Stage configs now also expose built-in "bad downstream behavior" toggles for nan_to_num, missing-flag drops, clipping, default fills, JSON/CSV round-trips, and aggregate_mean reductions. write_downstream_pipeline_report(...) turns either a single simulation or a multi-strategy comparison into DOWNSTREAM_PIPELINE_REPORT.md, DOWNSTREAM_PIPELINE_REPORT.json, and optional DOWNSTREAM_PIPELINE_REPORT.html artifacts. The JSON sidecar stores machine-readable per-stage sample snapshots for later analysis, while the human reports keep the stage-by-stage loss summaries that explicitly split fault vs semantic provenance when those labels are present. build_downstream_pipeline_stage_summary_rows(...) and write_downstream_pipeline_stage_summaries(...) provide flat per-stage JSON/CSV tables for downstream comparisons. write_downstream_pipeline_visualization_pack(...) writes composability_figure_pack/ with stage_transition_diagram.svg, failure_propagation_diagram.svg, corruption_sensitivity_curves.svg, a manifest, and the same stage-summary JSON/CSV payloads so composability failures are inspectable without opening a notebook.

4) examples/robotics/ inventory (status map)

No path under examples/robotics/ is currently a stable example. The supported robotics entry points for current work are the benchmark harness, scripts/reference_robotics_deployment.py, and the importable zeroproofml.reference_robotics_* helpers.

Path Label Why it is labeled this way
examples/robotics/__init__.py archive Package marker for a mostly legacy example folder; not a supported import surface by itself.
examples/robotics/demo_rr_ik.py archive Paper-era RR demo built on the legacy zeroproof/TR stack and kept only as a historical walkthrough.
examples/robotics/rr_ik_quick.py archive Convenience wrapper around the legacy RR trainer; useful for ad hoc reruns, but outside the current supported reproduction contract.
examples/robotics/rr_ik_train.py archive Large paper-era RR training script used by older ablation helpers, not by the current benchmark or reference-deployment APIs.
examples/robotics/rr_ik_dataset.py benchmark input Legacy wrapper that still delegates to the maintained RR IK benchmark dataset generator in zeroproofml.benchmarks.domains.rr_ik_dataset.
examples/robotics/rrr_ik_dataset.py experimental example Optional 3R dataset generator used for extra evidence paths, but not promoted to the supported benchmark/reference robotics surface.
examples/robotics/rrr_ik_train.py experimental example Optional 3R training/eval script for the same extra-evidence path; experimental and not covered by the stable benchmark contract.
examples/robotics/ik6r_dataset.py experimental example Optional 6R dataset generator used by legacy paper-compatible runners, with no current first-class benchmark/API promotion.
examples/robotics/ik6r_train.py experimental example Optional 6R training/eval script paired with the 6R generator; kept available for exploratory evidence rather than stable support.

Decision: keep examples/robotics/rrr_ik_* and examples/robotics/ik6r_* as experimental examples, not first-class experimental integrations. They remain available for exploratory or paper-compatible reruns, but promotion is deferred until they have a maintained zeroproofml.* import surface, a versioned artifact contract, and targeted CI coverage comparable to the RR benchmark/reference robotics paths.

Canonical naming for the remaining robotics example families now follows the script stem: rr_ik_* for the benchmark-input RR wrapper, rrr_ik_* for the 3R example, and ik6r_* for the 6R example. In practice that means the 3R example now defaults to data/rrr_ik_dataset.json and results/robotics/rrr_ik/rrr_ik_results.json, while the 6R example keeps data/ik6r_dataset.json and results/robotics/ik6r/ik6r_results.json. The experimental 3R/6R dataset generators also accept the RR-style --bucket-edges and --no-ensure_buckets_nonzero flags, while retaining the older underscore spellings for compatibility with legacy scripts.

Comparing runs across commits (provenance-first)

Each benchmark run writes a versioned manifest.json plus a versioned provenance.json that record the git commit SHA, package versions, arguments, dataset identifiers/fingerprints, metric definitions, OS / CPU / GPU model details, Python plus backend package versions (torch, onnx, onnxruntime, and installed optional backends), SHA256 hashes for saved checkpoints, discovered bundle directories, post-processed summaries, and whether the git worktree was already dirty when the run began. If a paper-mode run is resumed, resume_state.json preserves the original startup record and provenance.json["resume"] records both the original and resumed attempts.

To compare a new run against a prior baseline, pass the baseline run's aggregated/summary.json:

python -m zeroproofml.benchmarks dose --mode smoke --device cpu --seeds 1 \
  --baseline-summary results/benchmarks/dose/<baseline_run>/aggregated/summary.json

This causes aggregated/paired_stats.json / aggregated/paired_stats.md to include baseline deltas. Regenerating the report after that also writes figures/baseline_delta.svg from those paired stats. For Python workflows that need one or more baselines without hand-written summary loaders, use the run-loading API:

from zeroproofml.benchmarks import compare_benchmark_runs, load_benchmark_run

run = load_benchmark_run("results/benchmarks/dose/<new_run>")
comparison = compare_benchmark_runs(
    run,
    [
        "results/benchmarks/dose/<baseline_run_a>",
        "results/benchmarks/dose/<baseline_run_b>",
    ],
)
print(comparison.to_dict())

The following scripts remain in-repo for historical reference, but are not the primary reproduction path:

  • scripts/run_paper_260116.sh
  • scripts/EXPERIMENTAL_SUITE_README.md
  • scripts/paper_vps/ (older claim-audit and paper-specific post-processing)

Prefer the benchmark harness and reference deployment above for current, auditable runs.