Check 02 of 03 · Free implementation guide
Inference equivalence
Does your model produce the same outputs in both environments for identical inputs? Run this after confirming artifact parity. Takes 15–30 minutes.
Why this matters
Even with identical artifacts, different runtime environments can produce different outputs.
TF32 precision differences between GPU architectures. cuDNN algorithm selection that varies by hardware. BLAS backend differences between machines. Batch normalization in train mode instead of eval mode.
The way to find this: give both environments identical inputs and compare what comes out.
Step 01 — Collect golden inputs
You need 50–100 representative production inputs. Real inputs that your system actually processes. Not synthetic test cases — actual production data that represents the distribution your model sees.
Store them as a numpy array:
import numpy as np
import json
# If you have logged production inputs
golden_inputs = []
with open("production_logs.jsonl") as f:
for i, line in enumerate(f):
if i >= 100: break
log = json.loads(line)
golden_inputs.append(np.array(log["feature_vector"]))
X = np.stack(golden_inputs)
np.save("golden_inputs.npy", X)If you don't have production logs yet — use your evaluation test set as a proxy. It won't catch all runtime divergences but it will catch the most common ones.
Step 02 — Score in both environments
import numpy as np
import requests
X = np.load("golden_inputs.npy")
def score_endpoint(url, inputs):
logits = []
for row in inputs:
response = requests.post(
url,
json={"features": row.tolist()},
timeout=10
)
logits.append(response.json()["logits"])
return np.array(logits)
staging_logits = score_endpoint("http://model.staging/predict", X)
production_logits = score_endpoint("http://model.production/predict", X)Step 03 — Compare outputs
# How often do the two environments disagree on the prediction class
argmax_agreement = (
staging_logits.argmax(axis=1) == production_logits.argmax(axis=1)
).mean()
# Maximum absolute difference in logit values
max_abs_diff = np.abs(staging_logits - production_logits).max()
# Mean absolute difference per class
mean_abs_diff = np.abs(staging_logits - production_logits).mean(axis=0)
# Inputs where the environments disagree on the top prediction
disagreement_indices = np.where(
staging_logits.argmax(axis=1) != production_logits.argmax(axis=1)
)[0]
print(f"Argmax agreement: {argmax_agreement:.4f}")
print(f"Max absolute diff: {max_abs_diff:.8f}")
print(f"Mean absolute diff: {mean_abs_diff}")
print(f"Disagreements: {len(disagreement_indices)} of {len(X)}")
if len(disagreement_indices) > 0:
print("\nFirst 5 disagreeing inputs:")
for idx in disagreement_indices[:5]:
print(f" Input {idx}: staging={staging_logits[idx].argmax()}, "
f"production={production_logits[idx].argmax()}")Interpreting the results
Healthy system
Argmax agreement > 99.9%. Max absolute difference < 0.0001. Disagreements: 0 or 1 out of 100.
Significant divergence (argmax agreement < 99% or max diff > 0.01)
The runtime environment is producing different outputs. Most likely causes in order:
TF32 precision mismatch between GPU architectures. Fix: set TF32 flags to match training environment (see Check 01).
cuDNN algorithm non-determinism. Fix:
torch.backends.cudnn.deterministic = Truemodel.eval() not called. Fix: add it before every inference call.
Library version difference affecting numerical computation. Fix: pin all numerical libraries to identical versions.
Logits are identical but your application-level predictions differ
The model is fine. The bug is in your post-processing — threshold logic, calibration, output formatting, or label mapping that differs between environments.
Compare your post-processing code between staging and production line by line.
Everything matches
The model and runtime environment are not the problem. The problem is in the features — what the model is receiving as input differs between training and serving. Move to Check 03.
What to do with your finding
If you found runtime divergence: Fix the environment configuration. Redeploy. Re-run this check to confirm the fix landed. If argmax agreement hits 99.9%+ — you may have found the entire root cause.
If everything matched: Move to Check 03. The failure is in the pipeline.
Next
Check 03 — Log-and-replay feature diff
The Ground Truth Framework
The complete diagnostic playbook — 125 pages.
Full research foundation (Sculley, Zhang, Breck, DoorDash post-mortem). Every check with runnable code and interpretation tables. Four detailed case studies across defense, medical, industrial, and SaaS. Vertical-specific guides. Bibliography.
Found something but not sure whether it's the root cause?
That's what the diagnosis is for.
48 hours · Moe personally · Full refund if fewer than 3 findings · No calls