ONNX Runtime: The Swiss Army Knife of Model Serving
Cross-platform inference, quantization workflows, graph optimization, and benchmarks across CPU, GPU, and edge devices.
Why ONNX
ONNX (Open Neural Network Exchange) is an open format for representing ML models. ONNX Runtime (ORT) is Microsoft's inference engine that runs ONNX models on virtually any hardware.
Why do we use it on almost every project? Three reasons:
- Framework agnostic. Train in PyTorch, TensorFlow, or scikit-learn. Deploy with one runtime.
- Hardware agnostic. Same model runs on CPU, NVIDIA GPU, Intel hardware, ARM, and edge devices.
- Automatic optimization. Graph-level optimizations that improve performance without manual tuning.
Converting Models to ONNX
From PyTorch
import torch
import torch.onnx
model = MyModel()
model.load_state_dict(torch.load("model.pt"))
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
"model.onnx",
input_names=["input"],
output_names=["output"],
dynamic_axes={
"input": {0: "batch_size"},
"output": {0: "batch_size"}
},
opset_version=17
)
From scikit-learn
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
initial_type = [("input", FloatTensorType([None, num_features]))]
onnx_model = convert_sklearn(sklearn_model, initial_types=initial_type)
with open("model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
Graph Optimization
ORT's graph optimizer applies transformations that reduce computation without changing model behavior:
import onnxruntime as ort
sess_options = ort.SessionOptions()
# Level 1: Basic (constant folding, redundant node elimination)
# Level 2: Extended (attention fusion, layer normalization fusion)
# Level 99: All optimizations
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Save optimized model for inspection
sess_options.optimized_model_filepath = "model_optimized.onnx"
session = ort.InferenceSession("model.onnx", sess_options)
Key optimizations:
- Operator fusion: Combines Conv + BN + ReLU into a single fused operator
- Constant folding: Pre-computes operations on constant tensors
- Shape inference: Eliminates dynamic shape computation where possible
- Memory planning: Optimizes tensor memory allocation and reuse
On transformer models, operator fusion alone can provide 20-40% speedup by fusing multi-head attention components.
Quantization with ORT
ORT supports both dynamic and static quantization:
from onnxruntime.quantization import quantize_dynamic, quantize_static, CalibrationDataReader
# Dynamic quantization (no calibration data needed)
quantize_dynamic(
"model.onnx",
"model_int8.onnx",
weight_type=QuantType.QInt8
)
# Static quantization (better accuracy, needs calibration)
class MyCalibrationReader(CalibrationDataReader):
def __init__(self, calibration_data):
self.data = iter(calibration_data)
def get_next(self):
try:
return {"input": next(self.data)}
except StopIteration:
return None
quantize_static(
"model.onnx",
"model_int8_static.onnx",
calibration_data_reader=MyCalibrationReader(cal_data)
)
Execution Providers
ORT's execution provider system lets you target different hardware with the same code:
# CPU
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
# NVIDIA GPU
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])
# TensorRT (maximum NVIDIA performance)
session = ort.InferenceSession("model.onnx", providers=["TensorrtExecutionProvider"])
# Intel OpenVINO
session = ort.InferenceSession("model.onnx", providers=["OpenVINOExecutionProvider"])
# Fallback chain: try GPU first, fall back to CPU
session = ort.InferenceSession("model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
Benchmarks From Our Deployments
Results from real-world models we've deployed:
Image Classification (ResNet-50, batch size 1)
| Platform | FP32 | INT8 | Speedup | |----------|------|------|---------| | CPU (Xeon 8375C) | 28ms | 11ms | 2.5x | | GPU (T4) | 4.2ms | 2.1ms | 2.0x | | Jetson Orin | 8.5ms | 3.8ms | 2.2x |
Object Detection (YOLOv8-s, 640x640)
| Platform | FP32 | INT8 | Speedup | |----------|------|------|---------| | CPU (Xeon 8375C) | 95ms | 38ms | 2.5x | | GPU (T4) | 12ms | 5.8ms | 2.1x | | Jetson Orin | 22ms | 9.5ms | 2.3x |
Speech Recognition (DistilHuBERT, 1s audio)
| Platform | FP32 | INT8 | Speedup | |----------|------|------|---------| | CPU (Xeon 8375C) | 85ms | 32ms | 2.7x | | Jetson Orin | 45ms | 18ms | 2.5x |
Production Serving Pattern
Our standard ONNX serving setup:
from fastapi import FastAPI
import onnxruntime as ort
import numpy as np
app = FastAPI()
# Load model once at startup
session = ort.InferenceSession(
"model_optimized.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
sess_options=get_optimized_options()
)
@app.post("/predict")
async def predict(data: PredictionRequest):
input_array = preprocess(data)
outputs = session.run(None, {"input": input_array})
return {"prediction": postprocess(outputs[0])}
@app.get("/health")
async def health():
test_output = session.run(None, {"input": DUMMY_INPUT})
return {"status": "healthy"}
When Not to Use ONNX
ONNX isn't always the answer:
- Custom operators that aren't in the ONNX spec require writing C++ extensions
- Dynamic control flow (if/else based on tensor values) has limited support
- Very large models (LLMs with 70B+ parameters) are better served with specialized frameworks like vLLM or TGI
- Training — ONNX Runtime Training exists but isn't as mature as PyTorch/TensorFlow for training
For everything else — especially edge deployment, cross-platform serving, and optimized inference — ONNX Runtime is our default choice. It's the Swiss Army knife that goes on every deployment.
Discussion (2)
Solid technical depth. This is the kind of content that makes me actually trust a vendor — they clearly know what they're talking about because nobody writes at this level of specificity without real experience.
That's the goal — we write about what we've actually done, not what we've read about. Every article is based on real deployment experience, real numbers, real failures. Thanks for reading.