Bottom-Mask Provenance Design

Context

The current strict inference contract is intentionally small and fail-closed: decoded, bottom_mask, and gap_mask. That contract is already frozen into bundle metadata, compatibility fixtures, and downstream helpers such as StrictInferenceMonitor and reject_on_gap(...).

The question for Q1 is not whether provenance might be useful. It is whether ZeroProofML should change that public contract now, or first gather evidence with a narrower experimental path.

Options Compared

Option Strengths Main costs
Option A. Split masks (fault_mask + semantic_bottom_mask) Lossless and easy to aggregate with existing boolean-mask tooling Expands the public output contract immediately and invites consumers to treat the split as settled before the evidence exists
Option B. Keep bottom_mask, add bottom_provenance Preserves the merged fail-closed mask for legacy users Still changes the exported/runtime contract, and an enum/int-coded tensor is less natural than booleans across Torch/NumPy/ONNX helpers
Option C. Heuristic-first diagnostic path Preserves the current contract, can ship behind an opt-in flag, and matches the roadmap's "design first, decide after Q2 evidence" posture Provenance is only approximate at first and may need training-time support later

Decision

Choose Option C for Q1-Q2. Treat bottom provenance as an experimental diagnostic layer first, not as a stable inference/bundle contract change.

Rationale

  • The current repository already treats the merged bottom_mask as the stable public contract. Changing that shape now would force schema, bundle-fixture, monitoring, and compatibility churn before the value of the split is proven.
  • The codebase can reliably identify some numerical-fault cases at inference time, but it does not yet have a mature training-time signal for "semantic bottom" that would justify freezing a public tensor representation.
  • A heuristic-first path is enough to answer the Q2 roadmap question: does provenance materially improve monitoring, calibration, routing, or operator reporting compared with the existing merged-mask behavior?

Where Provenance Is Determined

Provenance will be determined via a hybrid staged approach:

  1. In Q1, determine provenance at the inference/diagnostic level only for cases that are already observable from the strict decode path: non-finite inputs, clear denominator faults, and the existing tau_infer versus tau_train gap region.
  2. Only if Q2 evidence is positive should the project add a deeper model/training-time provenance path and revisit whether a public tensor representation is worth standardizing.

This keeps the initial rollout grounded in signals the runtime already knows how to observe, while leaving room to add a training-time source later if the diagnostic experiment proves materially useful.

Rollout Stages

  1. Q1: experimental diagnostics only. Any richer provenance output must live behind an explicit opt-in flag or a separate diagnostic API. Stable APIs, exported bundles, and existing consumers continue to receive only the merged fail-closed bottom_mask.
  2. Q2: promotion review. Review whether the experimental signal delivered measurable value for at least one target use case: DOSE calibration analysis, robotics fallback handling, or the composability benchmark. The review should also confirm that merged-mask consumers did not regress.
  3. Post-Q2: choose one disposition explicitly. Promote provenance into a stable public contract only if the evidence is strong. Otherwise keep it experimental with a narrower scope, or remove the path entirely rather than carrying a weak contract forward.

Rollout Guardrails

  1. Keep bottom_mask as the fail-closed default for all stable APIs, bundles, and downstream consumers.
  2. Any provenance artifact added in Q1/Q2 must be opt-in and clearly marked experimental in config, metadata, and docs.
  3. Promotion requires evidence that provenance improves at least one downstream use case without regressing merged-mask consumers.
  4. If the experiment does not show clear value, keep provenance experimental or remove it rather than silently promoting a weak contract.

Measurable Promotion Criteria

Promotion out of the experimental lane requires a like-for-like comparison against the merged-mask baseline on the same commit, bundle, and evaluation inputs. The provenance path should be promoted only if all of the following hold:

  1. Stable-contract non-regression: with provenance disabled, strict inference still emits the same merged bottom_mask / gap_mask behavior, and stable bundle output names remain unchanged until an explicit schema review approves a contract bump.
  2. At least one target use case clears a concrete value threshold:
  3. DOSE calibration analysis: provenance-conditioned monitoring, routing, or threshold selection improves either false_in_range_rate_on_censored or false_censored_rate_on_in_range by >=10% relative at matched coverage (within 1 percentage point of the merged-mask baseline).
  4. Robotics fallback handling: provenance-aware fallback or operator reporting reduces unsafe accepts or fallback misroutes by >=25% relative, while accepted-sample coverage stays within 1 percentage point of the merged-mask baseline.
  5. Composability benchmark: provenance-aware composition/reporting reduces unresolved bottom/fallback ambiguity by >=15% relative without reducing valid-output coverage by more than 1 percentage point.
  6. Operational cost stays bounded: the opt-in diagnostic path adds no more than 5% wall-clock overhead on the review workload and does not force a stable metadata or ONNX output contract change before promotion is approved.
  7. Evidence is repeatable: the winning use case must clear its threshold in the recorded Q2 review artifact plus at least one rerun with the same measurement recipe, so the promotion decision is not based on a single noisy result.