Benchmark Artifact Glossary¶
This page is the schema and artifact reference for current
python -m zeroproofml.benchmarks ... runs. It describes the supported
scientific benchmark run directory, not the performance microbenchmark JSONs
under perf/.
Top-Level Contract¶
Every completed benchmark run is a self-contained directory under
results/benchmarks/<domain>/....
| Path | Schema marker | Producer | Purpose |
|---|---|---|---|
manifest.json |
zeroproofml_benchmark_manifest, version 4 |
benchmark harness | Run index, artifact paths, config hash, dataset fingerprints, checkpoint hashes, bundle hashes, and post-processed summary hashes. |
provenance.json |
zeroproofml_provenance, version 6 |
benchmark harness | Git, system, Python/backend package versions, dataset fingerprints, invocation args, config hash, and resume attempt history. |
resume_state.json |
zeroproofml_benchmark_resume_state, version 2 |
benchmark harness | Planned seed contract and attempt history used by --resume. |
CLAIM_AUDIT.md |
Markdown | claim audit | Human-readable pass/fail evidence for claim-gate checks. |
RUN_REPORT.md / RUN_REPORT.html |
Markdown / HTML | report layer | Human-readable summary regenerated from stored artifacts. |
manifest.json["artifacts"] stores paths relative to the run root. Treat it as
the authoritative table of contents for automation.
Per-Seed Artifacts¶
| Path | Schema marker | Purpose |
|---|---|---|
seed_*/per_seed_result.json |
domain-specific seed schema | Canonical raw result payload normalized by the harness. |
seed_*/dose_diagnostics.json |
zeroproofml_benchmark_dose_diagnostics |
DOSE per-seed confusion matrices, threshold sweeps, denominator histograms, borderline examples, and direction diagnostics. |
seed_*/rf_signal_traces.json |
RF trace sidecar | Saved response traces used to regenerate RF frequency-response figures. |
seed_*/rf_frequency_response.svg |
SVG | Per-seed Bode-style RF response plot from the saved trace sidecar. |
seed_*/rf_qualitative_figure_pack/ |
figure-pack manifest | Targeted RF washout / invented-peak examples selected from saved traces. |
Current canonical per-seed schema markers are:
| Domain | Schema marker |
|---|---|
| DOSE | zeroproofml_benchmark_dose_seed_result |
| RF | zeroproofml_benchmark_rf_seed_result |
| IK | zeroproofml_benchmark_ik_seed_result |
Aggregated Artifacts¶
| Path | Schema marker | Purpose |
|---|---|---|
aggregated/summary.json |
zeroproofml_benchmark_summary, version 1 |
Cross-seed metrics, metric definitions, per-seed metric table, and model summaries. |
aggregated/summary.md |
Markdown | Human-readable rendering of summary.json. |
aggregated/paired_stats.json |
zeroproofml_paired_stats, version 1 |
Per-model seed distributions and optional current-minus-baseline paired deltas. |
aggregated/paired_stats.md |
Markdown | Human-readable rendering of paired_stats.json. |
aggregated/benchmark_metrics.jsonl |
zeroproofml.metric_log, version 1 |
Versioned JSONL metric log emitted by the benchmark harness through JsonlLogger. |
aggregated/dose_operating_points.json |
DOSE operating-point schema | Backing evidence for safety-first, accuracy-first, and direction-aware presets. |
aggregated/dose_pareto_front.json |
DOSE Pareto-front schema | Nondominated safety/coverage, safety/accuracy, and direction/regression frontiers. |
aggregated/dose_diagnostics.json |
zeroproofml_benchmark_dose_diagnostics |
Aggregated DOSE diagnostics used for figure regeneration. |
aggregated/dirhead_only_summary.json |
frozen direction-head summary | DOSE --mode frozen-dirhead follow-up summary. |
aggregated/benchmark_metrics.jsonl uses these phases:
| Phase | Record type | Meaning |
|---|---|---|
benchmark_seed |
benchmark_seed |
One record per seed/model with finite numeric per-seed metrics. |
benchmark_summary |
benchmark_summary |
One record per model with aggregate metric means. |
benchmark_delta |
benchmark_delta |
One record per model with paired baseline mean deltas when a baseline is present. |
Figure Artifacts¶
| Path | Source artifact | Purpose |
|---|---|---|
figures/metric_summary.svg |
aggregated/summary.json |
Aggregate metric means by model. |
figures/per_seed_metric_distributions.svg |
aggregated/summary.json |
Box/scatter-style per-seed metric distributions. |
figures/baseline_delta.svg |
aggregated/paired_stats.json |
Current-minus-baseline mean deltas by model/metric. |
figures/dose_figure_pack/ |
DOSE diagnostics and operating points | Threshold sweeps, trade-offs, confusion matrices, edge cases, and provenance-split bottom histograms. |
figures/robotics_figure_pack/ |
IK seed artifacts and trajectory summaries | Workspace heatmaps, det(J)-stratified plots, solver-route maps, and fallback timelines. |
composability_figure_pack/ |
downstream pipeline simulator output | Stage transition diagrams, failure propagation diagrams, corruption sensitivity curves, and flat per-stage summaries. |
Validation Rules¶
Use zeroproofml.benchmarks.validate_run_dir(...) or
python scripts/ci/benchmark_gate.py --family scientific <run_dir> before
publishing or comparing a run. Validation checks the manifest/provenance/summary
schema markers, required files, and recorded artifact hashes. Older benchmark
artifacts without the current schema markers fail fast instead of being silently
mixed with current scientific claim runs.