Triton Inference Server Decision¶
Review date: 2026-04-19.
Decision¶
A Triton-style inference-server path is useful as an optional downstream deployment recipe after the ONNX Runtime bundle path is stable, but it should not become a first-party ZeroProofML runtime or base dependency yet.
The current project surface should stay centered on validated ONNX bundles,
load_onnx_runtime_bundle(...), the minimal REST adapter decision, and the
ROS 2 beta. Triton, KServe-compatible servers, or similar production-serving
systems should be treated as deployment infrastructure that consumes the
existing bundle contract rather than a new strict-inference contract.
Useful Scope¶
The first useful artifact would be a recipe, not package code:
- convert a validated bundle into a Triton model-repository layout,
- place
model.onnxunder a versioned model directory, - generate or document a minimal
config.pbtxtwith ONNX Runtime backend inputs, outputs, batch axis, and dynamic batching settings derived frommetadata.json, - keep
metadata.jsonnext to the model as the ZeroProofML semantic sidecar, - verify that Triton returns the stable
decoded,bottom_mask, andgap_maskoutputs in the bundle metadata order, and - compare one smoke batch against
load_onnx_runtime_bundle(...)before promoting the repository to production.
This recipe can reference Triton's documented model repository, ONNX Runtime backend, dynamic batching, concurrent execution, health endpoints, and metrics:
- https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html
- https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/backend/README.html
- https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/optimization.html
Promotion Gate¶
Do not add Triton containers, CI, generated model repositories, or client code until a deployment has at least one of these needs:
- high enough request concurrency that dynamic batching changes throughput or latency materially,
- multi-model or multi-version hosting that would otherwise duplicate local service code,
- GPU scheduling or instance-group tuning beyond plain ONNX Runtime providers,
- operations requirements for standardized health, metrics, and model-control APIs, or
- an existing production platform that already standardizes on Triton, KServe-compatible inference protocols, or a similar model server.
Until one of those needs exists, Triton support would mostly duplicate the lighter ONNX Runtime and REST surfaces while adding container, configuration, and compatibility maintenance.
Contract Boundaries¶
Triton can execute the exported ONNX graph, but it does not own ZeroProofML's
strict-inference semantics. metadata.json remains the authority for
strict_inference_fields, mask_semantics, thresholds, provenance mode, and
fallback policy identifiers. Any Triton recipe must preserve that sidecar and
run a parity check against the in-process ONNX Runtime bundle loader.
Avoid using Triton's Python backend or ensemble features for ZeroProofML post-processing until a concrete deployment requires server-side preprocessing, postprocessing, or routing. The current ONNX graph already carries the stable strict output tuple, and adding server-side Python would create another place where mask semantics could drift.
Relationship To REST And ROS 2¶
The minimal REST adapter remains the simpler process-boundary path for one bundle per service. ROS 2 remains the first-party robotics integration path. Triton should be revisited only for production serving concerns such as batching, model hosting, GPU utilization, Kubernetes-style operations, and standardized inference-server telemetry.