Benchmarks¶
This page summarizes the two benchmark families in the SCM stack.
Use these names consistently:
- Scientific claim benchmarks: python -m zeroproofml.benchmarks ..., run under
GitLab CI job names prefixed with claim-benchmark:*, and write auditable run
directories to results/benchmarks/<domain>/....
- Performance microbenchmark suite: python perf/run_benchmarks.py /
python perf/parity_runner.py, and write JSON/plot artifacts to
benchmark_results/.
Scientific claim benchmarks (DOSE / RF / IK)¶
Run the shipped, auditable benchmark domains via the first-class harness:
python -m zeroproofml.benchmarks dose --mode smoke --device cpu
python -m zeroproofml.benchmarks rf --mode smoke --device cpu
python -m zeroproofml.benchmarks ik --mode smoke --device cpu
If a paper-mode run is interrupted, resume it against the same output directory:
python -m zeroproofml.benchmarks dose --mode paper --device cpu \
--out-root results/benchmarks/dose/<run_dir> --resume
Use --skip-complete-seeds when you want to reuse an existing output directory
and run only seeds that do not yet have canonical per_seed_result.json
artifacts. Use --force-rerun to rerun the requested seeds in an existing
--out-root.
Add --html-report when you also want a browser-friendly RUN_REPORT.html
beside the default Markdown run report.
Regenerate the report later from the run artifacts without rerunning the benchmark:
python -m zeroproofml.report benchmark results/benchmarks/dose/<run_dir> --html-report
This refreshes RUN_REPORT.md, writes figures/metric_summary.svg plus
figures/per_seed_metric_distributions.svg, and embeds those SVGs in
RUN_REPORT.html when --html-report is set. If paired baseline stats are
present or recomputed, it also writes figures/baseline_delta.svg from
aggregated/paired_stats.json. Regenerated reports use artifact-relative paths
and refresh aggregated/summary.md from aggregated/summary.json, so copying a
run directory does not change the report content. When fault_rate /
semantic_bottom_rate provenance metrics are present, the standard report adds
a fault-vs-semantic bottom breakdown.
Use --baseline-run-dir <previous_run_dir> to recompute paired baseline
comparisons from another completed run. The same report CLI also accepts
deployment bundle directories (python -m zeroproofml.report bundle <bundle_dir>)
and JSONL training logs (python -m zeroproofml.report training-log <metrics.jsonl>),
writing VALIDATION_REPORT.summary.svg or <stem>_REPORT_metrics.svg beside
those reports. If Plotly is installed via zeroproofml[interactive], the
training-log path also writes <stem>_REPORT.html with interactive metric
traces. Bundle reports also discover parent benchmark summaries and DOSE
operating-point calibration sidecars when the bundle lives inside a benchmark
artifact directory.
Reproduction landing page: 21_experiments.md.
Each run writes a self-contained directory under results/benchmarks/<domain>/.
Smoke and paper modes both use seed_{n} subdirectories so downstream tooling sees one layout:
- seed_*/per_seed_result.json (versioned per-seed raw schema; legacy domain-script filenames are normalized into this path by the harness)
- seed_*/rf_signal_traces.json (RF only; representative per-seed frequency-response traces saved for figure regeneration)
- seed_*/rf_frequency_response.svg (RF only; first-class Bode-style magnitude/phase figure regenerated from the saved trace sidecar, including peak annotations, strict-trigger bands, and denominator-minimum guides)
- seed_*/rf_qualitative_figure_pack/ (RF only; targeted washout / invented-peak SVGs plus a small manifest/README for saved baseline failure cases)
- seed_*/dose_diagnostics.json (DOSE only; per-seed confusion matrices, threshold sweeps, |Q| histograms, borderline examples, and censored-subset direction diagnostics)
- manifest.json (versioned schema)
- provenance.json (versioned schema)
- resume_state.json (paper-mode resume state + attempt history cache)
- RUN_REPORT.md (RUN_REPORT.html too when --html-report is set)
- aggregated/summary.json + aggregated/summary.md
- aggregated/benchmark_metrics.jsonl (versioned JSONL metric log emitted by the benchmark harness)
- aggregated/dose_operating_points.json + aggregated/dose_operating_points.md (DOSE only)
- aggregated/dose_pareto_front.json + aggregated/dose_pareto_front.md (DOSE only)
- aggregated/dose_diagnostics.json (DOSE only; aggregated confusion matrices, sweep curves, histograms, borderline examples, and direction-quality summaries)
- aggregated/dirhead_only_summary.json (DOSE --mode frozen-dirhead only)
- aggregated/paired_stats.json + aggregated/paired_stats.md
- CLAIM_AUDIT.md
- figures/metric_summary.svg
- figures/per_seed_metric_distributions.svg
- figures/baseline_delta.svg (only when paired baseline deltas are present)
provenance.json records per-seed dataset fingerprints for both generated configs and file-backed datasets,
and manifest.json mirrors those fingerprints for quick auditing.
The RF synthetic-resonator path now stamps explicit dataset_name, dataset_version,
and dataset_generator fields into data_config, so RF run fingerprints freeze the
dataset-generation contract instead of relying only on the benchmark suite name.
That same data_config block now carries the canonical RF split contract too:
train and validation both sample moderate-Q filters with Q <= q_max_train,
test is the full held-out split over Q <= q_max_test, and extrapolation
is the standardized high-Q subset of test where Q > q_max_train. RF
per-seed results also stamp artifact_naming so checkpoint filenames and the
legacy-vs-canonical seed-result JSON names are machine-readable instead of
re-derived ad hoc, and that same metadata now points at the
rf_signal_traces.json sidecar plus the generated rf_frequency_response.svg
plot artifact used for figure regeneration/debugging. That RF figure now
overlays peak annotations, model-specific strict-trigger regions, and shared-denominator minima
directly on the saved response traces, and each seed now also writes an
rf_qualitative_figure_pack/ directory with targeted washout / invented-peak
examples selected from the saved trace set. The RF per-seed schema now also models the runner's
model_meta, per-model train_log, and wall_time_s fields directly, while
keeping the shared outer protocol / data_config / train_config / runs
layout aligned with the other benchmark domains.
RF summaries now also use a canonical metric set for the resonator benchmark:
peak_retention_yield, false_peak_hallucination_rate, in_band_mse,
extrapolation_mse, coverage_rate, strict_trigger_rate, and
denominator_min_abs_* sweep statistics. The runner still records the older
peak_success_rate / success_rate keys for compatibility, but post-processed
benchmark summaries normalize onto the standardized names. The per-seed
test_peak_clipping payload now also carries shared-denominator axis
diagnostics for models that expose Q(jw): minimum-frequency offset relative
to w0, the fraction of the sampled frequency axis that stays near the minimum,
and how often the minimum falls on a sweep edge.
Both files also record OS / CPU / GPU model details, Python and backend package versions
(torch, onnx, onnxruntime, plus installed optional backends such as jax / jaxlib),
SHA256 hashes for saved checkpoints, discovered bundle directories,
the post-processed benchmark summary files, and whether the git worktree was already dirty when
the run began. Resumed paper-mode runs keep the original startup metadata in resume_state.json
and embed the full attempt history under provenance.json["resume"].
You can verify a completed run directory from Python:
from zeroproofml.benchmarks import compare_benchmark_runs, load_benchmark_run, validate_run_dir
manifest = validate_run_dir("results/benchmarks/dose/run_20260407_150800_abcd123")
print(manifest["schema_version"])
run = load_benchmark_run("results/benchmarks/dose/run_20260407_150800_abcd123")
comparison = compare_benchmark_runs(
run,
["results/benchmarks/dose/run_20260401_120000_baseline"],
)
print(comparison.baselines[0].diffs)
Older benchmark artifacts without the current schema markers, or with an older schema version, now fail fast with a compatibility error. Migration loaders are intentionally deferred until there are external consumers that need them.
Artifact glossary and schema reference: 40_benchmark_artifact_reference.md.
RF-specific legacy analysis wrappers under scripts/rf/ still execute for compatibility,
but they now delegate to importable zeroproofml.benchmarks.domains.rf_* modules and emit
DeprecationWarning. Prefer the Python modules when composing RF benchmark or plotting flows.
Operating points in benchmarks¶
The DOSE domain exposes a clear operating-point trade-off:
- safety_first: prioritize low false in-range on censored (false_in_range_rate_on_censored)
- accuracy_first: prioritize low false censored on in-range (false_censored_rate_on_in_range)
- direction_aware: use the strict-gate + direction-head pattern to make bottom outputs actionable
Completed DOSE runs now back those names with a dedicated
aggregated/dose_operating_points.{json,md} report. It records the chosen
model, the run's tau_infer / tau_train values, and the aggregate
FP_cens/FN_in/direction-F1/accept data used to justify each preset. It also
records the deterministic calibration/evaluation split provenance per seed
(source split, split seed, generator seed, and sample-id reference) so the
operating-point evidence is auditable. The report evaluates an optional
balanced candidate too, but only promotes it when the balanced winner is
genuinely distinct; otherwise the benchmark keeps the three existing preset
names as the canonical public set. When experimental
fault_rate / semantic_bottom_rate splits are available, the calibration
logic also records a provenance-weighted bottom-cost term
fault_rate + 0.5 * semantic_bottom_rate, so semantic bottoms count less than
hard faults when comparing tau_infer operating points.
The same run also writes aggregated/dose_pareto_front.{json,md}, which marks
nondominated models for safety-vs-coverage, safety-vs-accuracy, and
direction-vs-regression trade-offs. Use that artifact for DOSE trade-off tables
instead of hand-curated notebook frontiers.
The frozen direction-head follow-up is available as a first-class DOSE benchmark mode:
python -m zeroproofml.benchmarks dose --mode frozen-dirhead --device cpu
That mode runs the strict-SCM base model with saved checkpoints, trains the
frozen censored-direction head per seed, and writes
aggregated/dirhead_only_summary.json beside the standard benchmark artifacts.
Completed DOSE runs also emit seed_*/dose_diagnostics.json plus
aggregated/dose_diagnostics.json. These artifacts expose the per-seed and
cross-seed confusion matrices, tau_infer sweep curves, |Q| histograms,
borderline near-threshold examples, and censored-subset direction-quality
breakdowns that explain why a chosen operating point behaves the way it does.
Regenerating a report for a DOSE run with these diagnostics writes
figures/dose_figure_pack/: threshold-sweep curves, macro-F1 vs finite-MAE
trade-off plots, censored-direction confusion plots, assay-limit edge-case
tables, and provenance-split bottom histograms for backed operating points when
fault_rate / semantic_bottom_rate are available.
Regenerating a report for an IK run writes
figures/robotics_figure_pack/ when the saved seed artifacts contain the RR IK
dataset/test diagnostics: workspace heatmaps, |det(J)|-stratified
error/fallback plots, route-to-analytic-solver maps, and fallback timelines.
The experimental visualization helpers include plot_confusion_matrix(...)
and plot_categorical_reliability(...) for categorical or direction-head
diagnostics derived from those saved predictions.
Structured Benchmark Logs¶
The benchmark harness writes aggregated/benchmark_metrics.jsonl using the same
versioned metric-log schema as JsonlLogger. It contains per-seed records
(phase="benchmark_seed"), aggregate records (phase="benchmark_summary"), and
baseline-delta records (phase="benchmark_delta" when paired stats exist).
You can inspect that log with the standard report command:
python -m zeroproofml.report training-log results/benchmarks/dose/<run_dir>/aggregated/benchmark_metrics.jsonl
When adding or debugging a benchmark domain, use the canonical logging helpers:
from zeroproofml.utils.logging import JsonlLogger, TensorBoardLogger, metric_log_record
jsonl_logger = JsonlLogger("results/benchmarks/dose/run_x/aggregated/custom_metrics.jsonl")
jsonl_logger(
metric_log_record(
{"macro_f1": 0.91, "bottom_rate": 0.08},
phase="benchmark_seed",
step=1,
context={"domain": "dose", "model": "strict", "seed": 1},
record_type="benchmark_seed",
)
)
# Experimental local-dashboard mirror. Keep JSONL as the release-facing artifact.
tensorboard_logger = TensorBoardLogger("results/benchmarks/dose/run_x/tensorboard")
tensorboard_logger({"step": 1, "macro_f1": 0.91, "bottom_rate": 0.08})
Use a τ_infer sweep on cached |Q| (or a calibration set) to set the actual thresholds; see docs/06_inference_deployment.md.
Canonical DOSE benchmark metrics:
| Benchmark key | Definition |
|---|---|
macro_f1 |
3-way macro-F1 over below / in-range / above classes. |
finite_mae |
Conditional finite MAE on truly in-range inputs, averaged only over accepted finite predictions. |
false_in_range_rate_on_censored |
Fraction of truly censored inputs predicted as in-range, conditional on the truly censored subset. |
false_censored_rate_on_in_range |
Fraction of truly in-range inputs predicted as censored, conditional on the truly in-range subset. |
direction_only_macro_f1_on_censored |
Optional direction-only macro-F1 over below / above on truly censored inputs. |
accept_rate |
Coverage / accept rate. In benchmark outputs this is the fraction of inputs that are not bottomed, so accept_rate = 1 - bottom_rate. |
bottom_rate |
Fraction of inputs that bottom under the strict |Q| < tau_infer gate. |
fault_rate / semantic_bottom_rate |
Optional stable fault/semantic provenance splits of bottom_rate, both measured over all inputs. |
gap_rate |
Optional monitor-only fraction with tau_infer <= |Q| < tau_train, measured over all inputs. |
Parity Runner (Coverage vs Accuracy)¶
Runs a small synthetic regression task and writes a coverage/accuracy curve:
python perf/parity_runner.py --output benchmark_results
Outputs:
- benchmark_results/parity_report.json
- benchmark_results/coverage_accuracy.png
Performance microbenchmark suite¶
Runs SCM microbenchmarks (arithmetic, layer forward/backward, inference-gap thresholds, etc.) and saves a timestamped JSON artifact:
python perf/run_benchmarks.py --suite all --output benchmark_results
Output: benchmark_results/benchmarks_<timestamp>.json
Performance-suite CI regression gate¶
Validates either scientific benchmark run directories or performance-suite artifact JSONs. Pass the family explicitly in CI so schema mistakes fail with a clear error instead of treating a scientific run as a microbenchmark payload:
python scripts/ci/benchmark_gate.py --family scientific results/benchmarks/dose/<run_dir>
python scripts/ci/benchmark_gate.py --family performance benchmark_results
Baseline Artifact (Optional)¶
Copies the newest performance-suite JSON in a results directory to perf/baseline.json:
python scripts/update_benchmark_baseline.py --src benchmark_results