Choosing tau_infer¶
tau_infer is the strict inference denominator threshold.
If |Q| < tau_infer, the sample is treated as bottom and lands in
bottom_mask.
Use this guide when you need to pick one deployment threshold and defend it in terms of saved artifacts rather than ad hoc tuning.
Default workflow¶
- Start from a held-out split or replay batch that matches deployment.
- Cache
|Q|values and the task labels you care about. - Sweep candidate thresholds before freezing a bundle.
- Record the chosen
tau_inferin the exported bundle metadata.
If you also want to monitor the train/infer gray zone, set tau_train above
tau_infer so strict inference can emit gap_mask.
Sweep first, then freeze¶
The supported post-hoc helper is tau_infer_sweep_from_q_abs(...).
It turns cached denominator magnitudes into false-positive, false-negative, and
bottom-rate curves without rerunning the full model:
from zeroproofml.metrics import tau_infer_sweep_from_q_abs, write_tau_infer_sweep
curves = tau_infer_sweep_from_q_abs(
q_abs=q_abs_held_out,
is_in_range=is_in_range,
taus=[1e-6, 3e-6, 1e-5, 3e-5, 1e-4],
)
write_tau_infer_sweep("results/tau_calibration", curves, provenance={"split": "held_out"})
That writes both a machine-readable JSON artifact and a compact Markdown report. Use the sweep to pick the smallest threshold that still meets the deployment's numerical-safety requirement.
Gauge scope: current post-hoc tau_infer sweeps over cached |Q|
distributions remain gauge-correct for the specific head and magnitude
convention that produced those distributions. Re-run or relabel the sweep before
comparing thresholds across heads, bundles, or normalization conventions.
DOSE calibration-set workflow¶
The DOSE benchmark writes the calibration workflow into the run artifact so the chosen operating point can be audited without rerunning notebooks:
python -m zeroproofml.benchmarks dose --mode smoke --seeds 1 --device cpu
For DOSE runs, the benchmark runner emits:
aggregated/dose_operating_points.jsonaggregated/dose_operating_points.mdaggregated/dose_diagnostics.json
dose_operating_points.json selects safety_first, accuracy_first, and
direction_aware candidates from the aggregate metrics. It also records the
tau_infer / tau_train values observed in each per-seed result. When the
seed result includes split provenance, the artifact includes the deterministic
calibration/evaluation split recipe and uses the provenance-weighted bottom
cost fault_rate + 0.5 * semantic_bottom_rate; otherwise it falls back to the
merged bottom_rate.
Treat that JSON file as the release-facing operating-point record. The Markdown sidecar is the human-readable summary for paper reviews and deployment notes.
Practical selection rules¶
- Prefer a threshold derived from held-out
|Q|data, not a training default. - Keep the choice task-specific: censoring, robotics, and RF runs usually want different operating points.
- If false accepts are the safety risk, bias
tau_inferupward. - If unnecessary abstention is the main cost, bias
tau_inferdownward and usegap_maskfor extra monitoring. - Re-run the sweep whenever the model family, normalization, or input preprocessing changes.
The reference robotics deployment uses exactly this pattern: it estimates
tau_infer from a held-out |Q| quantile, then sets tau_train as a fixed
multiple so deployment can track the gap region separately.
Ship the threshold with the bundle¶
Once chosen, freeze the threshold in InferenceConfig and export the bundle:
from zeroproofml.inference import InferenceConfig, SCMInferenceWrapper, export_bundle
cfg = InferenceConfig(tau_infer=1e-5, tau_train=1e-4)
wrapped = SCMInferenceWrapper(model, config=cfg)
export_bundle(wrapped, (x_example,), "bundle_dir")
metadata.json records tau_infer and optional tau_train, so downstream
consumers and regenerated reports can explain the exact gate that was shipped.