Triton Inference Server Decision

Review date: 2026-04-19.

Decision

A Triton-style inference-server path is useful as an optional downstream deployment recipe after the ONNX Runtime bundle path is stable, but it should not become a first-party ZeroProofML runtime or base dependency yet.

The current project surface should stay centered on validated ONNX bundles, load_onnx_runtime_bundle(...), the minimal REST adapter decision, and the ROS 2 beta. Triton, KServe-compatible servers, or similar production-serving systems should be treated as deployment infrastructure that consumes the existing bundle contract rather than a new strict-inference contract.

Useful Scope

The first useful artifact would be a recipe, not package code:

  • convert a validated bundle into a Triton model-repository layout,
  • place model.onnx under a versioned model directory,
  • generate or document a minimal config.pbtxt with ONNX Runtime backend inputs, outputs, batch axis, and dynamic batching settings derived from metadata.json,
  • keep metadata.json next to the model as the ZeroProofML semantic sidecar,
  • verify that Triton returns the stable decoded, bottom_mask, and gap_mask outputs in the bundle metadata order, and
  • compare one smoke batch against load_onnx_runtime_bundle(...) before promoting the repository to production.

This recipe can reference Triton's documented model repository, ONNX Runtime backend, dynamic batching, concurrent execution, health endpoints, and metrics:

Promotion Gate

Do not add Triton containers, CI, generated model repositories, or client code until a deployment has at least one of these needs:

  • high enough request concurrency that dynamic batching changes throughput or latency materially,
  • multi-model or multi-version hosting that would otherwise duplicate local service code,
  • GPU scheduling or instance-group tuning beyond plain ONNX Runtime providers,
  • operations requirements for standardized health, metrics, and model-control APIs, or
  • an existing production platform that already standardizes on Triton, KServe-compatible inference protocols, or a similar model server.

Until one of those needs exists, Triton support would mostly duplicate the lighter ONNX Runtime and REST surfaces while adding container, configuration, and compatibility maintenance.

Contract Boundaries

Triton can execute the exported ONNX graph, but it does not own ZeroProofML's strict-inference semantics. metadata.json remains the authority for strict_inference_fields, mask_semantics, thresholds, provenance mode, and fallback policy identifiers. Any Triton recipe must preserve that sidecar and run a parity check against the in-process ONNX Runtime bundle loader.

Avoid using Triton's Python backend or ensemble features for ZeroProofML post-processing until a concrete deployment requires server-side preprocessing, postprocessing, or routing. The current ONNX graph already carries the stable strict output tuple, and adding server-side Python would create another place where mask semantics could drift.

Relationship To REST And ROS 2

The minimal REST adapter remains the simpler process-boundary path for one bundle per service. ROS 2 remains the first-party robotics integration path. Triton should be revisited only for production serving concerns such as batching, model hosting, GPU utilization, Kubernetes-style operations, and standardized inference-server telemetry.