Benchmarks

This page summarizes the two benchmark families in the SCM stack.

Use these names consistently: - Scientific claim benchmarks: python -m zeroproofml.benchmarks ..., run under GitLab CI job names prefixed with claim-benchmark:*, and write auditable run directories to results/benchmarks/<domain>/.... - Performance microbenchmark suite: python perf/run_benchmarks.py / python perf/parity_runner.py, and write JSON/plot artifacts to benchmark_results/.

Scientific claim benchmarks (DOSE / RF / IK)

Run the shipped, auditable benchmark domains via the first-class harness:

python -m zeroproofml.benchmarks dose --mode smoke --device cpu
python -m zeroproofml.benchmarks rf --mode smoke --device cpu
python -m zeroproofml.benchmarks ik --mode smoke --device cpu

If a paper-mode run is interrupted, resume it against the same output directory:

python -m zeroproofml.benchmarks dose --mode paper --device cpu \
  --out-root results/benchmarks/dose/<run_dir> --resume

Use --skip-complete-seeds when you want to reuse an existing output directory and run only seeds that do not yet have canonical per_seed_result.json artifacts. Use --force-rerun to rerun the requested seeds in an existing --out-root.

Add --html-report when you also want a browser-friendly RUN_REPORT.html beside the default Markdown run report.

Regenerate the report later from the run artifacts without rerunning the benchmark:

python -m zeroproofml.report benchmark results/benchmarks/dose/<run_dir> --html-report

This refreshes RUN_REPORT.md, writes figures/metric_summary.svg plus figures/per_seed_metric_distributions.svg, and embeds those SVGs in RUN_REPORT.html when --html-report is set. If paired baseline stats are present or recomputed, it also writes figures/baseline_delta.svg from aggregated/paired_stats.json. Regenerated reports use artifact-relative paths and refresh aggregated/summary.md from aggregated/summary.json, so copying a run directory does not change the report content. When fault_rate / semantic_bottom_rate provenance metrics are present, the standard report adds a fault-vs-semantic bottom breakdown.

Use --baseline-run-dir <previous_run_dir> to recompute paired baseline comparisons from another completed run. The same report CLI also accepts deployment bundle directories (python -m zeroproofml.report bundle <bundle_dir>) and JSONL training logs (python -m zeroproofml.report training-log <metrics.jsonl>), writing VALIDATION_REPORT.summary.svg or <stem>_REPORT_metrics.svg beside those reports. If Plotly is installed via zeroproofml[interactive], the training-log path also writes <stem>_REPORT.html with interactive metric traces. Bundle reports also discover parent benchmark summaries and DOSE operating-point calibration sidecars when the bundle lives inside a benchmark artifact directory.

Reproduction landing page: 21_experiments.md.

Each run writes a self-contained directory under results/benchmarks/<domain>/. Smoke and paper modes both use seed_{n} subdirectories so downstream tooling sees one layout: - seed_*/per_seed_result.json (versioned per-seed raw schema; legacy domain-script filenames are normalized into this path by the harness) - seed_*/rf_signal_traces.json (RF only; representative per-seed frequency-response traces saved for figure regeneration) - seed_*/rf_frequency_response.svg (RF only; first-class Bode-style magnitude/phase figure regenerated from the saved trace sidecar, including peak annotations, strict-trigger bands, and denominator-minimum guides) - seed_*/rf_qualitative_figure_pack/ (RF only; targeted washout / invented-peak SVGs plus a small manifest/README for saved baseline failure cases) - seed_*/dose_diagnostics.json (DOSE only; per-seed confusion matrices, threshold sweeps, |Q| histograms, borderline examples, and censored-subset direction diagnostics) - manifest.json (versioned schema) - provenance.json (versioned schema) - resume_state.json (paper-mode resume state + attempt history cache) - RUN_REPORT.md (RUN_REPORT.html too when --html-report is set) - aggregated/summary.json + aggregated/summary.md - aggregated/benchmark_metrics.jsonl (versioned JSONL metric log emitted by the benchmark harness) - aggregated/dose_operating_points.json + aggregated/dose_operating_points.md (DOSE only) - aggregated/dose_pareto_front.json + aggregated/dose_pareto_front.md (DOSE only) - aggregated/dose_diagnostics.json (DOSE only; aggregated confusion matrices, sweep curves, histograms, borderline examples, and direction-quality summaries) - aggregated/dirhead_only_summary.json (DOSE --mode frozen-dirhead only) - aggregated/paired_stats.json + aggregated/paired_stats.md - CLAIM_AUDIT.md - figures/metric_summary.svg - figures/per_seed_metric_distributions.svg - figures/baseline_delta.svg (only when paired baseline deltas are present)

provenance.json records per-seed dataset fingerprints for both generated configs and file-backed datasets, and manifest.json mirrors those fingerprints for quick auditing. The RF synthetic-resonator path now stamps explicit dataset_name, dataset_version, and dataset_generator fields into data_config, so RF run fingerprints freeze the dataset-generation contract instead of relying only on the benchmark suite name. That same data_config block now carries the canonical RF split contract too: train and validation both sample moderate-Q filters with Q <= q_max_train, test is the full held-out split over Q <= q_max_test, and extrapolation is the standardized high-Q subset of test where Q > q_max_train. RF per-seed results also stamp artifact_naming so checkpoint filenames and the legacy-vs-canonical seed-result JSON names are machine-readable instead of re-derived ad hoc, and that same metadata now points at the rf_signal_traces.json sidecar plus the generated rf_frequency_response.svg plot artifact used for figure regeneration/debugging. That RF figure now overlays peak annotations, model-specific strict-trigger regions, and shared-denominator minima directly on the saved response traces, and each seed now also writes an rf_qualitative_figure_pack/ directory with targeted washout / invented-peak examples selected from the saved trace set. The RF per-seed schema now also models the runner's model_meta, per-model train_log, and wall_time_s fields directly, while keeping the shared outer protocol / data_config / train_config / runs layout aligned with the other benchmark domains. RF summaries now also use a canonical metric set for the resonator benchmark: peak_retention_yield, false_peak_hallucination_rate, in_band_mse, extrapolation_mse, coverage_rate, strict_trigger_rate, and denominator_min_abs_* sweep statistics. The runner still records the older peak_success_rate / success_rate keys for compatibility, but post-processed benchmark summaries normalize onto the standardized names. The per-seed test_peak_clipping payload now also carries shared-denominator axis diagnostics for models that expose Q(jw): minimum-frequency offset relative to w0, the fraction of the sampled frequency axis that stays near the minimum, and how often the minimum falls on a sweep edge. Both files also record OS / CPU / GPU model details, Python and backend package versions (torch, onnx, onnxruntime, plus installed optional backends such as jax / jaxlib), SHA256 hashes for saved checkpoints, discovered bundle directories, the post-processed benchmark summary files, and whether the git worktree was already dirty when the run began. Resumed paper-mode runs keep the original startup metadata in resume_state.json and embed the full attempt history under provenance.json["resume"].

You can verify a completed run directory from Python:

from zeroproofml.benchmarks import compare_benchmark_runs, load_benchmark_run, validate_run_dir

manifest = validate_run_dir("results/benchmarks/dose/run_20260407_150800_abcd123")
print(manifest["schema_version"])

run = load_benchmark_run("results/benchmarks/dose/run_20260407_150800_abcd123")
comparison = compare_benchmark_runs(
    run,
    ["results/benchmarks/dose/run_20260401_120000_baseline"],
)
print(comparison.baselines[0].diffs)

Older benchmark artifacts without the current schema markers, or with an older schema version, now fail fast with a compatibility error. Migration loaders are intentionally deferred until there are external consumers that need them.

Artifact glossary and schema reference: 40_benchmark_artifact_reference.md.

RF-specific legacy analysis wrappers under scripts/rf/ still execute for compatibility, but they now delegate to importable zeroproofml.benchmarks.domains.rf_* modules and emit DeprecationWarning. Prefer the Python modules when composing RF benchmark or plotting flows.

Operating points in benchmarks

The DOSE domain exposes a clear operating-point trade-off: - safety_first: prioritize low false in-range on censored (false_in_range_rate_on_censored) - accuracy_first: prioritize low false censored on in-range (false_censored_rate_on_in_range) - direction_aware: use the strict-gate + direction-head pattern to make bottom outputs actionable

Completed DOSE runs now back those names with a dedicated aggregated/dose_operating_points.{json,md} report. It records the chosen model, the run's tau_infer / tau_train values, and the aggregate FP_cens/FN_in/direction-F1/accept data used to justify each preset. It also records the deterministic calibration/evaluation split provenance per seed (source split, split seed, generator seed, and sample-id reference) so the operating-point evidence is auditable. The report evaluates an optional balanced candidate too, but only promotes it when the balanced winner is genuinely distinct; otherwise the benchmark keeps the three existing preset names as the canonical public set. When experimental fault_rate / semantic_bottom_rate splits are available, the calibration logic also records a provenance-weighted bottom-cost term fault_rate + 0.5 * semantic_bottom_rate, so semantic bottoms count less than hard faults when comparing tau_infer operating points. The same run also writes aggregated/dose_pareto_front.{json,md}, which marks nondominated models for safety-vs-coverage, safety-vs-accuracy, and direction-vs-regression trade-offs. Use that artifact for DOSE trade-off tables instead of hand-curated notebook frontiers.

The frozen direction-head follow-up is available as a first-class DOSE benchmark mode:

python -m zeroproofml.benchmarks dose --mode frozen-dirhead --device cpu

That mode runs the strict-SCM base model with saved checkpoints, trains the frozen censored-direction head per seed, and writes aggregated/dirhead_only_summary.json beside the standard benchmark artifacts.

Completed DOSE runs also emit seed_*/dose_diagnostics.json plus aggregated/dose_diagnostics.json. These artifacts expose the per-seed and cross-seed confusion matrices, tau_infer sweep curves, |Q| histograms, borderline near-threshold examples, and censored-subset direction-quality breakdowns that explain why a chosen operating point behaves the way it does. Regenerating a report for a DOSE run with these diagnostics writes figures/dose_figure_pack/: threshold-sweep curves, macro-F1 vs finite-MAE trade-off plots, censored-direction confusion plots, assay-limit edge-case tables, and provenance-split bottom histograms for backed operating points when fault_rate / semantic_bottom_rate are available. Regenerating a report for an IK run writes figures/robotics_figure_pack/ when the saved seed artifacts contain the RR IK dataset/test diagnostics: workspace heatmaps, |det(J)|-stratified error/fallback plots, route-to-analytic-solver maps, and fallback timelines. The experimental visualization helpers include plot_confusion_matrix(...) and plot_categorical_reliability(...) for categorical or direction-head diagnostics derived from those saved predictions.

Structured Benchmark Logs

The benchmark harness writes aggregated/benchmark_metrics.jsonl using the same versioned metric-log schema as JsonlLogger. It contains per-seed records (phase="benchmark_seed"), aggregate records (phase="benchmark_summary"), and baseline-delta records (phase="benchmark_delta" when paired stats exist).

You can inspect that log with the standard report command:

python -m zeroproofml.report training-log results/benchmarks/dose/<run_dir>/aggregated/benchmark_metrics.jsonl

When adding or debugging a benchmark domain, use the canonical logging helpers:

from zeroproofml.utils.logging import JsonlLogger, TensorBoardLogger, metric_log_record

jsonl_logger = JsonlLogger("results/benchmarks/dose/run_x/aggregated/custom_metrics.jsonl")
jsonl_logger(
    metric_log_record(
        {"macro_f1": 0.91, "bottom_rate": 0.08},
        phase="benchmark_seed",
        step=1,
        context={"domain": "dose", "model": "strict", "seed": 1},
        record_type="benchmark_seed",
    )
)

# Experimental local-dashboard mirror. Keep JSONL as the release-facing artifact.
tensorboard_logger = TensorBoardLogger("results/benchmarks/dose/run_x/tensorboard")
tensorboard_logger({"step": 1, "macro_f1": 0.91, "bottom_rate": 0.08})

Use a τ_infer sweep on cached |Q| (or a calibration set) to set the actual thresholds; see docs/06_inference_deployment.md.

Canonical DOSE benchmark metrics:

Benchmark key Definition
macro_f1 3-way macro-F1 over below / in-range / above classes.
finite_mae Conditional finite MAE on truly in-range inputs, averaged only over accepted finite predictions.
false_in_range_rate_on_censored Fraction of truly censored inputs predicted as in-range, conditional on the truly censored subset.
false_censored_rate_on_in_range Fraction of truly in-range inputs predicted as censored, conditional on the truly in-range subset.
direction_only_macro_f1_on_censored Optional direction-only macro-F1 over below / above on truly censored inputs.
accept_rate Coverage / accept rate. In benchmark outputs this is the fraction of inputs that are not bottomed, so accept_rate = 1 - bottom_rate.
bottom_rate Fraction of inputs that bottom under the strict |Q| < tau_infer gate.
fault_rate / semantic_bottom_rate Optional stable fault/semantic provenance splits of bottom_rate, both measured over all inputs.
gap_rate Optional monitor-only fraction with tau_infer <= |Q| < tau_train, measured over all inputs.

Parity Runner (Coverage vs Accuracy)

Runs a small synthetic regression task and writes a coverage/accuracy curve:

python perf/parity_runner.py --output benchmark_results

Outputs: - benchmark_results/parity_report.json - benchmark_results/coverage_accuracy.png

Performance microbenchmark suite

Runs SCM microbenchmarks (arithmetic, layer forward/backward, inference-gap thresholds, etc.) and saves a timestamped JSON artifact:

python perf/run_benchmarks.py --suite all --output benchmark_results

Output: benchmark_results/benchmarks_<timestamp>.json

Performance-suite CI regression gate

Validates either scientific benchmark run directories or performance-suite artifact JSONs. Pass the family explicitly in CI so schema mistakes fail with a clear error instead of treating a scientific run as a microbenchmark payload:

python scripts/ci/benchmark_gate.py --family scientific results/benchmarks/dose/<run_dir>
python scripts/ci/benchmark_gate.py --family performance benchmark_results

Baseline Artifact (Optional)

Copies the newest performance-suite JSON in a results directory to perf/baseline.json:

python scripts/update_benchmark_baseline.py --src benchmark_results