Benchmark Artifact Glossary¶

This page is the schema and artifact reference for current python -m zeroproofml.benchmarks ... runs. It describes the supported scientific benchmark run directory, not the performance microbenchmark JSONs under perf/.

Top-Level Contract¶

Every completed benchmark run is a self-contained directory under results/benchmarks/<domain>/....

Path	Schema marker	Producer	Purpose
`manifest.json`	`zeroproofml_benchmark_manifest`, version 4	benchmark harness	Run index, artifact paths, config hash, dataset fingerprints, checkpoint hashes, bundle hashes, and post-processed summary hashes.
`provenance.json`	`zeroproofml_provenance`, version 6	benchmark harness	Git, system, Python/backend package versions, dataset fingerprints, invocation args, config hash, and resume attempt history.
`resume_state.json`	`zeroproofml_benchmark_resume_state`, version 2	benchmark harness	Planned seed contract and attempt history used by `--resume`.
`CLAIM_AUDIT.md`	Markdown	claim audit	Human-readable pass/fail evidence for claim-gate checks.
`RUN_REPORT.md` / `RUN_REPORT.html`	Markdown / HTML	report layer	Human-readable summary regenerated from stored artifacts.

manifest.json["artifacts"] stores paths relative to the run root. Treat it as the authoritative table of contents for automation.

Per-Seed Artifacts¶

Path	Schema marker	Purpose
`seed_*/per_seed_result.json`	domain-specific seed schema	Canonical raw result payload normalized by the harness.
`seed_*/dose_diagnostics.json`	`zeroproofml_benchmark_dose_diagnostics`	DOSE per-seed confusion matrices, threshold sweeps, denominator histograms, borderline examples, and direction diagnostics.
`seed_*/rf_signal_traces.json`	RF trace sidecar	Saved response traces used to regenerate RF frequency-response figures.
`seed_*/rf_frequency_response.svg`	SVG	Per-seed Bode-style RF response plot from the saved trace sidecar.
`seed_*/rf_qualitative_figure_pack/`	figure-pack manifest	Targeted RF washout / invented-peak examples selected from saved traces.

Current canonical per-seed schema markers are:

Domain	Schema marker
DOSE	`zeroproofml_benchmark_dose_seed_result`
RF	`zeroproofml_benchmark_rf_seed_result`
IK	`zeroproofml_benchmark_ik_seed_result`

Aggregated Artifacts¶

Path	Schema marker	Purpose
`aggregated/summary.json`	`zeroproofml_benchmark_summary`, version 1	Cross-seed metrics, metric definitions, per-seed metric table, and model summaries.
`aggregated/summary.md`	Markdown	Human-readable rendering of `summary.json`.
`aggregated/paired_stats.json`	`zeroproofml_paired_stats`, version 1	Per-model seed distributions and optional current-minus-baseline paired deltas.
`aggregated/paired_stats.md`	Markdown	Human-readable rendering of `paired_stats.json`.
`aggregated/benchmark_metrics.jsonl`	`zeroproofml.metric_log`, version 1	Versioned JSONL metric log emitted by the benchmark harness through `JsonlLogger`.
`aggregated/dose_operating_points.json`	DOSE operating-point schema	Backing evidence for safety-first, accuracy-first, and direction-aware presets.
`aggregated/dose_pareto_front.json`	DOSE Pareto-front schema	Nondominated safety/coverage, safety/accuracy, and direction/regression frontiers.
`aggregated/dose_diagnostics.json`	`zeroproofml_benchmark_dose_diagnostics`	Aggregated DOSE diagnostics used for figure regeneration.
`aggregated/dirhead_only_summary.json`	frozen direction-head summary	DOSE `--mode frozen-dirhead` follow-up summary.

aggregated/benchmark_metrics.jsonl uses these phases:

Phase	Record type	Meaning
`benchmark_seed`	`benchmark_seed`	One record per seed/model with finite numeric per-seed metrics.
`benchmark_summary`	`benchmark_summary`	One record per model with aggregate metric means.
`benchmark_delta`	`benchmark_delta`	One record per model with paired baseline mean deltas when a baseline is present.

Figure Artifacts¶

Path	Source artifact	Purpose
`figures/metric_summary.svg`	`aggregated/summary.json`	Aggregate metric means by model.
`figures/per_seed_metric_distributions.svg`	`aggregated/summary.json`	Box/scatter-style per-seed metric distributions.
`figures/baseline_delta.svg`	`aggregated/paired_stats.json`	Current-minus-baseline mean deltas by model/metric.
`figures/dose_figure_pack/`	DOSE diagnostics and operating points	Threshold sweeps, trade-offs, confusion matrices, edge cases, and provenance-split bottom histograms.
`figures/robotics_figure_pack/`	IK seed artifacts and trajectory summaries	Workspace heatmaps, det(J)-stratified plots, solver-route maps, and fallback timelines.
`composability_figure_pack/`	downstream pipeline simulator output	Stage transition diagrams, failure propagation diagrams, corruption sensitivity curves, and flat per-stage summaries.

Validation Rules¶

Use zeroproofml.benchmarks.validate_run_dir(...) or python scripts/ci/benchmark_gate.py --family scientific <run_dir> before publishing or comparing a run. Validation checks the manifest/provenance/summary schema markers, required files, and recorded artifact hashes. Older benchmark artifacts without the current schema markers fail fast instead of being silently mixed with current scientific claim runs.