Check 01 of 03 · Free implementation guide

Artifact and environment parity

Are staging and production actually running the same bytes? This check takes five minutes and surfaces 10–20% of production ML failures. Run it before anything else.

Why this matters

Partial rollouts, rebuilt container tags, and library version drift routinely leave production running a different preprocessor than staging — with the same model weights.

The outputs diverge. Nobody connects the dots because the model didn't change.

Three specific mechanisms cause this failure

Model or preprocessor checksum mismatch

A partial rollout or caching issue means production is running an older version of the model or preprocessor than staging. The outputs are different because the artifacts are different.

Container image digest mismatch

You deployed the same tag. But the underlying image was rebuilt — a base image update, a dependency upgrade. The digest changed. The behavior changed. Nobody checked the digest.

Library version drift

Your training environment pins scikit-learn 1.2.2. Your serving environment runs 1.3.x. Between those versions, the behavior of OneHotEncoder for unknown categories changed. StandardScaler with sparse matrices changed.ColumnTransformer output format changed. Your model is being served in a different world than the one it was trained in.

GPU precision differences

NVIDIA's Ampere and Hopper GPUs (A100, H100) default to TF32 precision for matrix multiplications — 10-bit mantissa instead of 23-bit. A model trained on a V100 (full float32) and served on an A100 (silently TF32) produces different numerical outputs for identical inputs. Small logit differences. Systematic prediction flips on edge cases.

model.eval() not called

A single forgotten line. If your model uses batch normalization and your inference code doesn't set evaluation mode, the normalization layers recompute batch statistics on every inference call. With batch size 1 — standard in real-time serving — the statistics are meaningless. The model behavior is unpredictable.

How to run the check

Run every command below in both environments — staging and production. Compare every line of output. Any difference is a finding.

Model and preprocessor checksums

bash
sha256sum /models/model.pt
sha256sum /models/scaler.joblib
sha256sum /models/preprocessor.pkl
sha256sum /models/vocab.json
sha256sum /models/label_encoder.pkl

Run for every artifact your serving pipeline loads. They must be identical between environments.

Container image digest

bash
# Not the tag — the actual content digest
docker inspect --format='{{index .RepoDigests 0}}' your-image:tag

# If running in Kubernetes
kubectl get pod POD_NAME -o jsonpath='{.spec.containers[0].image}'

# The full image reference including digest
docker inspect --format='{{.Id}}' your-image:tag

Tags get rebuilt. Digests don't lie. If the digest differs between environments, the image differs.

Runtime versions

python
python -c "
import sys, numpy as np
import torch

print('Python:', sys.version)
print('PyTorch:', torch.__version__)
print('CUDA:', torch.version.cuda)
print('cuDNN:', torch.backends.cudnn.version())
print('NumPy:', np.__version__)

try:
    import sklearn
    print('scikit-learn:', sklearn.__version__)
except: pass

try:
    import pandas as pd
    print('pandas:', pd.__version__)
except: pass
"

GPU hardware and driver

bash
nvidia-smi --query-gpu=driver_version,name,compute_cap --format=csv,noheader

TF32 precision settings — critical for Ampere/Hopper GPUs

python
python -c "
import torch
print('TF32 matmul allowed:', torch.backends.cuda.matmul.allow_tf32)
print('TF32 cudnn allowed:', torch.backends.cudnn.allow_tf32)
print('cuDNN deterministic:', torch.backends.cudnn.deterministic)
print('cuDNN benchmark:', torch.backends.cudnn.benchmark)
"

model.eval() verification

Search your serving code for every inference code path:

bash
grep -r "model.predict\|model.forward\|model(" ./serving --include="*.py" | head -40
grep -r "model.eval()" ./serving --include="*.py"

Every code path that calls the model for inference should have model.eval()called before it. If you find inference calls without a corresponding model.eval() — that's your bug.

Interpreting the results

Different model or preprocessor checksums

The artifacts in production are not what was evaluated. Confirm which version is correct and redeploy consistently.

Different container image digest with same tag

Your CI/CD is rebuilding images without versioning them properly. Move to digest-pinned deployments.

Different library versions

Rebuild both environments from the same dependency lockfile. Usepip freeze > requirements.txt from the environment that produced your best evaluation results. Pin every package to the exact version including patch releases.

TF32 enabled in one environment but not the other

Add this to every inference initialization and ensure it matches what was used during training:

python
import torch

# Set these to match your training environment exactly
torch.backends.cuda.matmul.allow_tf32 = False  # or True — must match training
torch.backends.cudnn.allow_tf32 = False          # must match training
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

model.eval() missing

Add it to every inference code path. No exceptions.

python
model.eval()
with torch.no_grad():
    output = model(input_tensor)

What to do with your finding

If you found something: Fix it. Redeploy. Re-evaluate. This is often the entire root cause.

If everything matched: Good. The artifacts and environment are not the problem. Move to Check 02.

Next

Check 02 — Inference equivalence

Continue →
📘Free · Full technical playbook

The Ground Truth Framework
The complete diagnostic playbook — 125 pages.

125
pages
27
figures & charts
40+
code examples
9
verticals covered

Full research foundation (Sculley, Zhang, Breck, DoorDash post-mortem). Every check with runnable code and interpretation tables. Four detailed case studies across defense, medical, industrial, and SaaS. Vertical-specific guides. Bibliography.

Unlock the playbook

No spam. One email when the guide gets major updates.

Still stuck after fixing what you found?

That's what the diagnosis is for. Pattern recognition from 40+ systems.

Get diagnosed — $1,500

48 hours · Moe personally · Full refund if fewer than 3 findings · No calls