Check 02 of 03 · Free implementation guide

Inference equivalence

Does your model produce the same outputs in both environments for identical inputs? Run this after confirming artifact parity. Takes 15–30 minutes.

Why this matters

Even with identical artifacts, different runtime environments can produce different outputs.

TF32 precision differences between GPU architectures. cuDNN algorithm selection that varies by hardware. BLAS backend differences between machines. Batch normalization in train mode instead of eval mode.

The way to find this: give both environments identical inputs and compare what comes out.

Step 01 — Collect golden inputs

You need 50–100 representative production inputs. Real inputs that your system actually processes. Not synthetic test cases — actual production data that represents the distribution your model sees.

Store them as a numpy array:

python

import numpy as np
import json

# If you have logged production inputs
golden_inputs = []
with open("production_logs.jsonl") as f:
    for i, line in enumerate(f):
        if i >= 100: break
        log = json.loads(line)
        golden_inputs.append(np.array(log["feature_vector"]))

X = np.stack(golden_inputs)
np.save("golden_inputs.npy", X)

If you don't have production logs yet — use your evaluation test set as a proxy. It won't catch all runtime divergences but it will catch the most common ones.

Step 02 — Score in both environments

python

import numpy as np
import requests

X = np.load("golden_inputs.npy")

def score_endpoint(url, inputs):
    logits = []
    for row in inputs:
        response = requests.post(
            url,
            json={"features": row.tolist()},
            timeout=10
        )
        logits.append(response.json()["logits"])
    return np.array(logits)

staging_logits    = score_endpoint("http://model.staging/predict", X)
production_logits = score_endpoint("http://model.production/predict", X)

Step 03 — Compare outputs

python

# How often do the two environments disagree on the prediction class
argmax_agreement = (
    staging_logits.argmax(axis=1) == production_logits.argmax(axis=1)
).mean()

# Maximum absolute difference in logit values
max_abs_diff = np.abs(staging_logits - production_logits).max()

# Mean absolute difference per class
mean_abs_diff = np.abs(staging_logits - production_logits).mean(axis=0)

# Inputs where the environments disagree on the top prediction
disagreement_indices = np.where(
    staging_logits.argmax(axis=1) != production_logits.argmax(axis=1)
)[0]

print(f"Argmax agreement:    {argmax_agreement:.4f}")
print(f"Max absolute diff:   {max_abs_diff:.8f}")
print(f"Mean absolute diff:  {mean_abs_diff}")
print(f"Disagreements:       {len(disagreement_indices)} of {len(X)}")

if len(disagreement_indices) > 0:
    print("\nFirst 5 disagreeing inputs:")
    for idx in disagreement_indices[:5]:
        print(f"  Input {idx}: staging={staging_logits[idx].argmax()}, "
              f"production={production_logits[idx].argmax()}")

Interpreting the results

Healthy system

Argmax agreement > 99.9%. Max absolute difference < 0.0001. Disagreements: 0 or 1 out of 100.

Significant divergence (argmax agreement < 99% or max diff > 0.01)

The runtime environment is producing different outputs. Most likely causes in order:

TF32 precision mismatch between GPU architectures. Fix: set TF32 flags to match training environment (see Check 01).

cuDNN algorithm non-determinism. Fix:

python

torch.backends.cudnn.deterministic = True

model.eval() not called. Fix: add it before every inference call.

Library version difference affecting numerical computation. Fix: pin all numerical libraries to identical versions.

Logits are identical but your application-level predictions differ

The model is fine. The bug is in your post-processing — threshold logic, calibration, output formatting, or label mapping that differs between environments.

Compare your post-processing code between staging and production line by line.

Everything matches

The model and runtime environment are not the problem. The problem is in the features — what the model is receiving as input differs between training and serving. Move to Check 03.

What to do with your finding

If you found runtime divergence: Fix the environment configuration. Redeploy. Re-run this check to confirm the fix landed. If argmax agreement hits 99.9%+ — you may have found the entire root cause.

If everything matched: Move to Check 03. The failure is in the pipeline.

Check 03 — Log-and-replay feature diff

Continue →

📘Free · Full technical playbook

The Ground Truth Framework
The complete diagnostic playbook — 125 pages.

125

pages

figures & charts

40+

code examples

verticals covered

Full research foundation (Sculley, Zhang, Breck, DoorDash post-mortem). Every check with runnable code and interpretation tables. Four detailed case studies across defense, medical, industrial, and SaaS. Vertical-specific guides. Bibliography.

Found something but not sure whether it's the root cause?

That's what the diagnosis is for.

Get diagnosed — $1,500

48 hours · Moe personally · Full refund if fewer than 3 findings · No calls

Inference equivalence

Why this matters

Step 01 — Collect golden inputs

Step 02 — Score in both environments

Step 03 — Compare outputs

Interpreting the results

Healthy system

Significant divergence (argmax agreement < 99% or max diff > 0.01)

Logits are identical but your application-level predictions differ

Everything matches

What to do with your finding

The Ground Truth FrameworkThe complete diagnostic playbook — 125 pages.

Found something but not sure whether it's the root cause?

The Ground Truth Framework
The complete diagnostic playbook — 125 pages.