REST/gRPC Bundle Service Decision

Review date: 2026-04-19.

Decision

A minimal REST inference service around validated ONNX bundles is worthwhile as an optional integration surface, but it should stay a thin adapter over load_onnx_runtime_bundle(...) rather than becoming a new runtime stack. Do not add a gRPC service or protobuf contract yet. Keep any future HTTP server outside the base install path so the core package does not gain web-framework or serving dependencies.

REST Scope

The useful first service is deliberately small:

  • load exactly one bundle at startup from a configured bundle_dir,
  • fail fast through the existing bundle metadata validation path,
  • expose health and metadata readback endpoints for operators,
  • expose one inference endpoint that accepts named tensor inputs in bundle metadata order,
  • return the stable decoded, bottom_mask, and gap_mask fields plus any metadata needed to interpret the response, and
  • reuse StrictInferenceMonitor for aggregate mask/gap/provenance summaries.

The service should not include model registry behavior, live bundle reloads, auth, autoscaling, streaming, request batching, or training-stack imports. Those belong in deployment infrastructure or a later serving layer once there is a real consumer requirement.

Why REST First

  • REST covers non-ROS consumers that need a simple process boundary without linking against Python or ONNX Runtime directly.
  • The current bundle contract already provides the necessary runtime metadata, tensor order, mask semantics, and strict output fields.
  • A tiny HTTP adapter is easier to document and smoke-test than a stable protobuf API while the provenance-bearing bundle contract remains experimental.
  • ROS 2 remains the first-party robotics path; REST is for local services, batch jobs, notebooks, and operator handoff where ROS is unnecessary.
  • Keeping the server optional preserves the dependency-light core package.

Deferred gRPC Path

Revisit gRPC only when a concrete deployment needs one of the things REST does not provide well:

  • binary tensor transport with lower serialization overhead,
  • streaming or bidirectional request patterns,
  • a language-neutral IDL that multiple maintained clients will consume, or
  • strict integration with an existing gRPC control plane.

Until then, a protobuf surface would create another compatibility contract before the project has enough evidence that it is worth maintaining.

Relationship To Inference Servers

This decision does not replace the later Triton-or-similar evaluation. A minimal REST adapter is a convenience and interoperability path for one bundle per process. Triton-style serving should be evaluated separately for batching, multi-model hosting, GPU scheduling, metrics, and production deployment features.