Choosing `tau_infer`¶

tau_infer is the strict inference denominator threshold. If |Q| < tau_infer, the sample is treated as bottom and lands in bottom_mask.

Use this guide when you need to pick one deployment threshold and defend it in terms of saved artifacts rather than ad hoc tuning.

Default workflow¶

Start from a held-out split or replay batch that matches deployment.
Cache |Q| values and the task labels you care about.
Sweep candidate thresholds before freezing a bundle.
Record the chosen tau_infer in the exported bundle metadata.

If you also want to monitor the train/infer gray zone, set tau_train above tau_infer so strict inference can emit gap_mask.

Sweep first, then freeze¶

The supported post-hoc helper is tau_infer_sweep_from_q_abs(...). It turns cached denominator magnitudes into false-positive, false-negative, and bottom-rate curves without rerunning the full model:

from zeroproofml.metrics import tau_infer_sweep_from_q_abs, write_tau_infer_sweep

curves = tau_infer_sweep_from_q_abs(
    q_abs=q_abs_held_out,
    is_in_range=is_in_range,
    taus=[1e-6, 3e-6, 1e-5, 3e-5, 1e-4],
)
write_tau_infer_sweep("results/tau_calibration", curves, provenance={"split": "held_out"})

That writes both a machine-readable JSON artifact and a compact Markdown report. Use the sweep to pick the smallest threshold that still meets the deployment's numerical-safety requirement.

Gauge scope: current post-hoc tau_infer sweeps over cached |Q| distributions remain gauge-correct for the specific head and magnitude convention that produced those distributions. Re-run or relabel the sweep before comparing thresholds across heads, bundles, or normalization conventions.

DOSE calibration-set workflow¶

The DOSE benchmark writes the calibration workflow into the run artifact so the chosen operating point can be audited without rerunning notebooks:

python -m zeroproofml.benchmarks dose --mode smoke --seeds 1 --device cpu

For DOSE runs, the benchmark runner emits:

aggregated/dose_operating_points.json
aggregated/dose_operating_points.md
aggregated/dose_diagnostics.json

dose_operating_points.json selects safety_first, accuracy_first, and direction_aware candidates from the aggregate metrics. It also records the tau_infer / tau_train values observed in each per-seed result. When the seed result includes split provenance, the artifact includes the deterministic calibration/evaluation split recipe and uses the provenance-weighted bottom cost fault_rate + 0.5 * semantic_bottom_rate; otherwise it falls back to the merged bottom_rate.

Treat that JSON file as the release-facing operating-point record. The Markdown sidecar is the human-readable summary for paper reviews and deployment notes.

Practical selection rules¶

Prefer a threshold derived from held-out |Q| data, not a training default.
Keep the choice task-specific: censoring, robotics, and RF runs usually want different operating points.
If false accepts are the safety risk, bias tau_infer upward.
If unnecessary abstention is the main cost, bias tau_infer downward and use gap_mask for extra monitoring.
Re-run the sweep whenever the model family, normalization, or input preprocessing changes.

The reference robotics deployment uses exactly this pattern: it estimates tau_infer from a held-out |Q| quantile, then sets tau_train as a fixed multiple so deployment can track the gap region separately.

Ship the threshold with the bundle¶

Once chosen, freeze the threshold in InferenceConfig and export the bundle:

from zeroproofml.inference import InferenceConfig, SCMInferenceWrapper, export_bundle

cfg = InferenceConfig(tau_infer=1e-5, tau_train=1e-4)
wrapped = SCMInferenceWrapper(model, config=cfg)
export_bundle(wrapped, (x_example,), "bundle_dir")

metadata.json records tau_infer and optional tau_train, so downstream consumers and regenerated reports can explain the exact gate that was shipped.

Choosing tau_infer¶