Check 03 of 03 · Free implementation guide
Log-and-replay feature diff
Are training and serving computing the same features from the same raw inputs? This is the check nobody runs. And it's where the problem almost always lives.
Why this matters
Your training pipeline and your serving pipeline started identical.
Then they drifted.
Someone fixed a bug in training and forgot serving. Someone added a feature in Python and reimplemented it slightly differently in Go. A library updated and changed a default. The data team changed an upstream schema and only one pipeline handled it correctly.
The result: the features your model sees at inference time are computed differently from the features it was trained on.
Your offline evaluation metrics look fine — because evaluation uses the training pipeline. Production performance is degraded — because production uses the serving pipeline.
The gap is invisible in your metrics. It's only visible when you directly compare what the two pipelines compute on identical raw inputs.
DoorDash found a 4.3 percentage-point AUC gap this way. One feature. One lookback window. Silent for months.
Step 01 — Add feature logging to production
If you don't have this — add it now. It's the most valuable instrumentation decision you can make.
import json
import logging
import time
import uuid
prediction_logger = logging.getLogger("predictions")
def predict_with_logging(raw_input):
request_id = str(uuid.uuid4())
# Your serving featurizer
features = serving_featurize(raw_input)
# Your model
prediction = model.predict(features)
# Log everything
prediction_logger.info(json.dumps({
"request_id": request_id,
"timestamp": time.time(),
"raw_input": serialize_input(raw_input),
"feature_vector": features.tolist(),
"prediction": prediction.tolist()
}))
return predictionConfigure the logger to write to a file or log aggregation system. Collect at least 100 production requests before running the comparison — more is better.
Step 02 — Replay through training pipeline
from training.pipeline import featurize as train_featurize
import numpy as np
import pandas as pd
import json
# Load production logs
logs = []
with open("production_sample.jsonl") as f:
for line in f:
logs.append(json.loads(line))
print(f"Loaded {len(logs)} production requests")
# Features as computed by the serving pipeline
serving_features = np.stack([
np.array(r["feature_vector"]) for r in logs
])
# Features as the training pipeline would compute
# from the same raw inputs
training_features = np.stack([
train_featurize(r["raw_input"]) for r in logs
])
print(f"Serving features shape: {serving_features.shape}")
print(f"Training features shape: {training_features.shape}")Step 03 — Compute per-feature disagreement
# Per-feature disagreement rate
# How often does serving compute a different value than training
# for the same raw input?
tolerance = 1e-6 # features should be identical to floating point precision
disagreement = (
np.abs(serving_features - training_features) > tolerance
).mean(axis=0)
# Absolute difference statistics per feature
abs_diff = np.abs(serving_features - training_features)
mean_diff = abs_diff.mean(axis=0)
max_diff = abs_diff.max(axis=0)
# Build results dataframe
results = pd.DataFrame({
"feature": FEATURE_NAMES, # your feature names list
"disagreement": disagreement,
"mean_abs_diff": mean_diff,
"max_abs_diff": max_diff
}).sort_values("disagreement", ascending=False)
print("\nTop diverging features:")
print(results[results["disagreement"] > 0.001].to_string(index=False))Step 04 — Prioritize by impact
Not all divergence is equally harmful. A feature with 80% disagreement rate that barely affects predictions matters less than a feature with 15% disagreement rate that the model relies on heavily.
Cross-reference with feature importance:
import shap
# Get SHAP values for feature importance
# Use a sample of serving features
explainer = shap.TreeExplainer(model) # for tree models
# explainer = shap.DeepExplainer(model, background_data) # for neural nets
shap_values = explainer.shap_values(serving_features[:100])
feature_importance = np.abs(shap_values).mean(axis=0)
# Impact score = importance × disagreement rate
results["importance"] = feature_importance
results["impact_score"] = results["importance"] * results["disagreement"]
results = results.sort_values("impact_score", ascending=False)
print("\nFix these first (highest impact × disagreement):")
print(results[["feature", "disagreement", "importance", "impact_score"]]
.head(10).to_string(index=False))Common divergences and their fixes
Rolling window mismatch
Training:
df.groupby("user_id")["value"].rolling(7).mean()Serving: rolling over 14 days because the spec was ambiguous.
Fix: make the window size an explicit constant shared between both pipelines. Test with known inputs.
Timezone handling
Training:
pd.to_datetime(ts, utc=True).dt.hourServing:
datetime.fromtimestamp(ts).hourFix: always convert to UTC before any time-based feature computation. In both pipelines.
Missing value handling
Training:
df[col].fillna(df[col].mean())Serving:
df[col].fillna(0)Fix: compute and save the training set mean. Load it in serving. Use the same imputation value.
Categorical encoding
Training: scikit-learn LabelEncoder — alphabetical order. Serving: custom encoder — first-seen order.
Fix: save the fitted LabelEncoder from training. Load and use it in serving. Never reimplement from scratch.
Scaling parameter staleness
Training: StandardScaler fitted 8 months ago, parameters saved. Serving: loads those same 8-month-old parameters.
Fix: refresh scaling parameters every time you retrain. Version the saved parameters alongside the model artifact.
Interpreting your results
Zero disagreement across all features
Pipelines are synchronized. The failure is in the data distribution — what's arriving in production is genuinely different from what the model was trained on. You need distribution analysis (PSI, KS-test) to quantify the gap and targeted data collection to close it.
Disagreement on a small number of features at high rates (> 40%)
Classic pipeline drift. One or two specific transformations diverged. Fix the highest-impact features first. Rerun the comparison after each fix.
Disagreement on many features at low rates (5–15%)
Systemic drift — often a library version change or a cross-language reimplementation. Look for a single root cause that explains many features simultaneously.
Disagreement that correlates with specific raw input characteristics
Conditional divergence — the pipelines agree on most inputs but differ on edge cases. Look for branching logic, missing value handling, or out-of-vocabulary category handling that differs between pipelines.
The fix sequence
- 01Fix the highest impact-score features first.
- 02After each fix, rerun the full comparison. Confirm the disagreement rate dropped.
- 03Watch for secondary divergences that were masked by the primary one. Fixing one feature sometimes reveals another that was hidden underneath.
- 04When disagreement is zero across all features — run Check 02 again. Confirm inference equivalence still holds with the corrected features.
- 05Re-evaluate your model on production data with the corrected pipeline. Measure the actual remaining performance gap.
After you've run all three checks
You now know one of three things:
You found and fixed the root cause. Production performance recovers. The system works.
You found something but you're not sure it's the whole story. Partial improvement. There may be a secondary failure mode the first fix revealed.
You found nothing across all three checks. The failure is subtler — a genuine distribution shift, a concept drift, or an architectural issue that these checks don't surface directly.
In the second and third cases — that's what the diagnosis is for. Pattern recognition from 40+ systems. Knowing which combination of failure modes you're dealing with, what the fix sequence looks like, and whether your deadline is achievable.
The Ground Truth Framework
The complete diagnostic playbook — 125 pages.
Full research foundation (Sculley, Zhang, Breck, DoorDash post-mortem). Every check with runnable code and interpretation tables. Four detailed case studies across defense, medical, industrial, and SaaS. Vertical-specific guides. Bibliography.
Found something and not sure what to do with it?
Fixed everything you found and still broken? Found nothing and still broken?
48 hours · Moe personally · Full refund if fewer than 3 findings · No calls