Benchmark Artifact Glossary

This page is the schema and artifact reference for current python -m zeroproofml.benchmarks ... runs. It describes the supported scientific benchmark run directory, not the performance microbenchmark JSONs under perf/.

Top-Level Contract

Every completed benchmark run is a self-contained directory under results/benchmarks/<domain>/....

Path Schema marker Producer Purpose
manifest.json zeroproofml_benchmark_manifest, version 4 benchmark harness Run index, artifact paths, config hash, dataset fingerprints, checkpoint hashes, bundle hashes, and post-processed summary hashes.
provenance.json zeroproofml_provenance, version 6 benchmark harness Git, system, Python/backend package versions, dataset fingerprints, invocation args, config hash, and resume attempt history.
resume_state.json zeroproofml_benchmark_resume_state, version 2 benchmark harness Planned seed contract and attempt history used by --resume.
CLAIM_AUDIT.md Markdown claim audit Human-readable pass/fail evidence for claim-gate checks.
RUN_REPORT.md / RUN_REPORT.html Markdown / HTML report layer Human-readable summary regenerated from stored artifacts.

manifest.json["artifacts"] stores paths relative to the run root. Treat it as the authoritative table of contents for automation.

Per-Seed Artifacts

Path Schema marker Purpose
seed_*/per_seed_result.json domain-specific seed schema Canonical raw result payload normalized by the harness.
seed_*/dose_diagnostics.json zeroproofml_benchmark_dose_diagnostics DOSE per-seed confusion matrices, threshold sweeps, denominator histograms, borderline examples, and direction diagnostics.
seed_*/rf_signal_traces.json RF trace sidecar Saved response traces used to regenerate RF frequency-response figures.
seed_*/rf_frequency_response.svg SVG Per-seed Bode-style RF response plot from the saved trace sidecar.
seed_*/rf_qualitative_figure_pack/ figure-pack manifest Targeted RF washout / invented-peak examples selected from saved traces.

Current canonical per-seed schema markers are:

Domain Schema marker
DOSE zeroproofml_benchmark_dose_seed_result
RF zeroproofml_benchmark_rf_seed_result
IK zeroproofml_benchmark_ik_seed_result

Aggregated Artifacts

Path Schema marker Purpose
aggregated/summary.json zeroproofml_benchmark_summary, version 1 Cross-seed metrics, metric definitions, per-seed metric table, and model summaries.
aggregated/summary.md Markdown Human-readable rendering of summary.json.
aggregated/paired_stats.json zeroproofml_paired_stats, version 1 Per-model seed distributions and optional current-minus-baseline paired deltas.
aggregated/paired_stats.md Markdown Human-readable rendering of paired_stats.json.
aggregated/benchmark_metrics.jsonl zeroproofml.metric_log, version 1 Versioned JSONL metric log emitted by the benchmark harness through JsonlLogger.
aggregated/dose_operating_points.json DOSE operating-point schema Backing evidence for safety-first, accuracy-first, and direction-aware presets.
aggregated/dose_pareto_front.json DOSE Pareto-front schema Nondominated safety/coverage, safety/accuracy, and direction/regression frontiers.
aggregated/dose_diagnostics.json zeroproofml_benchmark_dose_diagnostics Aggregated DOSE diagnostics used for figure regeneration.
aggregated/dirhead_only_summary.json frozen direction-head summary DOSE --mode frozen-dirhead follow-up summary.

aggregated/benchmark_metrics.jsonl uses these phases:

Phase Record type Meaning
benchmark_seed benchmark_seed One record per seed/model with finite numeric per-seed metrics.
benchmark_summary benchmark_summary One record per model with aggregate metric means.
benchmark_delta benchmark_delta One record per model with paired baseline mean deltas when a baseline is present.

Figure Artifacts

Path Source artifact Purpose
figures/metric_summary.svg aggregated/summary.json Aggregate metric means by model.
figures/per_seed_metric_distributions.svg aggregated/summary.json Box/scatter-style per-seed metric distributions.
figures/baseline_delta.svg aggregated/paired_stats.json Current-minus-baseline mean deltas by model/metric.
figures/dose_figure_pack/ DOSE diagnostics and operating points Threshold sweeps, trade-offs, confusion matrices, edge cases, and provenance-split bottom histograms.
figures/robotics_figure_pack/ IK seed artifacts and trajectory summaries Workspace heatmaps, det(J)-stratified plots, solver-route maps, and fallback timelines.
composability_figure_pack/ downstream pipeline simulator output Stage transition diagrams, failure propagation diagrams, corruption sensitivity curves, and flat per-stage summaries.

Validation Rules

Use zeroproofml.benchmarks.validate_run_dir(...) or python scripts/ci/benchmark_gate.py --family scientific <run_dir> before publishing or comparing a run. Validation checks the manifest/provenance/summary schema markers, required files, and recorded artifact hashes. Older benchmark artifacts without the current schema markers fail fast instead of being silently mixed with current scientific claim runs.