Check 03 of 03 · Free implementation guide

Log-and-replay feature diff

Are training and serving computing the same features from the same raw inputs? This is the check nobody runs. And it's where the problem almost always lives.

Why this matters

Your training pipeline and your serving pipeline started identical.

Then they drifted.

Someone fixed a bug in training and forgot serving. Someone added a feature in Python and reimplemented it slightly differently in Go. A library updated and changed a default. The data team changed an upstream schema and only one pipeline handled it correctly.

The result: the features your model sees at inference time are computed differently from the features it was trained on.

Your offline evaluation metrics look fine — because evaluation uses the training pipeline. Production performance is degraded — because production uses the serving pipeline.

The gap is invisible in your metrics. It's only visible when you directly compare what the two pipelines compute on identical raw inputs.

DoorDash found a 4.3 percentage-point AUC gap this way. One feature. One lookback window. Silent for months.

Step 01 — Add feature logging to production

If you don't have this — add it now. It's the most valuable instrumentation decision you can make.

python
import json
import logging
import time
import uuid

prediction_logger = logging.getLogger("predictions")

def predict_with_logging(raw_input):
    request_id = str(uuid.uuid4())

    # Your serving featurizer
    features = serving_featurize(raw_input)

    # Your model
    prediction = model.predict(features)

    # Log everything
    prediction_logger.info(json.dumps({
        "request_id": request_id,
        "timestamp": time.time(),
        "raw_input": serialize_input(raw_input),
        "feature_vector": features.tolist(),
        "prediction": prediction.tolist()
    }))

    return prediction

Configure the logger to write to a file or log aggregation system. Collect at least 100 production requests before running the comparison — more is better.

Step 02 — Replay through training pipeline

python
from training.pipeline import featurize as train_featurize
import numpy as np
import pandas as pd
import json

# Load production logs
logs = []
with open("production_sample.jsonl") as f:
    for line in f:
        logs.append(json.loads(line))

print(f"Loaded {len(logs)} production requests")

# Features as computed by the serving pipeline
serving_features = np.stack([
    np.array(r["feature_vector"]) for r in logs
])

# Features as the training pipeline would compute
# from the same raw inputs
training_features = np.stack([
    train_featurize(r["raw_input"]) for r in logs
])

print(f"Serving features shape:  {serving_features.shape}")
print(f"Training features shape: {training_features.shape}")

Step 03 — Compute per-feature disagreement

python
# Per-feature disagreement rate
# How often does serving compute a different value than training
# for the same raw input?

tolerance = 1e-6  # features should be identical to floating point precision

disagreement = (
    np.abs(serving_features - training_features) > tolerance
).mean(axis=0)

# Absolute difference statistics per feature
abs_diff = np.abs(serving_features - training_features)
mean_diff = abs_diff.mean(axis=0)
max_diff  = abs_diff.max(axis=0)

# Build results dataframe
results = pd.DataFrame({
    "feature":        FEATURE_NAMES,  # your feature names list
    "disagreement":   disagreement,
    "mean_abs_diff":  mean_diff,
    "max_abs_diff":   max_diff
}).sort_values("disagreement", ascending=False)

print("\nTop diverging features:")
print(results[results["disagreement"] > 0.001].to_string(index=False))

Step 04 — Prioritize by impact

Not all divergence is equally harmful. A feature with 80% disagreement rate that barely affects predictions matters less than a feature with 15% disagreement rate that the model relies on heavily.

Cross-reference with feature importance:

python
import shap

# Get SHAP values for feature importance
# Use a sample of serving features
explainer = shap.TreeExplainer(model)      # for tree models
# explainer = shap.DeepExplainer(model, background_data)  # for neural nets

shap_values = explainer.shap_values(serving_features[:100])
feature_importance = np.abs(shap_values).mean(axis=0)

# Impact score = importance × disagreement rate
results["importance"]   = feature_importance
results["impact_score"] = results["importance"] * results["disagreement"]
results = results.sort_values("impact_score", ascending=False)

print("\nFix these first (highest impact × disagreement):")
print(results[["feature", "disagreement", "importance", "impact_score"]]
      .head(10).to_string(index=False))

Common divergences and their fixes

Rolling window mismatch

Training:

python
df.groupby("user_id")["value"].rolling(7).mean()

Serving: rolling over 14 days because the spec was ambiguous.

Fix: make the window size an explicit constant shared between both pipelines. Test with known inputs.

Timezone handling

Training:

python
pd.to_datetime(ts, utc=True).dt.hour

Serving:

python
datetime.fromtimestamp(ts).hour
— local server time.

Fix: always convert to UTC before any time-based feature computation. In both pipelines.

Missing value handling

Training:

python
df[col].fillna(df[col].mean())
— uses training set mean.

Serving:

python
df[col].fillna(0)
— uses zero.

Fix: compute and save the training set mean. Load it in serving. Use the same imputation value.

Categorical encoding

Training: scikit-learn LabelEncoder — alphabetical order. Serving: custom encoder — first-seen order.

Fix: save the fitted LabelEncoder from training. Load and use it in serving. Never reimplement from scratch.

Scaling parameter staleness

Training: StandardScaler fitted 8 months ago, parameters saved. Serving: loads those same 8-month-old parameters.

Fix: refresh scaling parameters every time you retrain. Version the saved parameters alongside the model artifact.

Interpreting your results

Zero disagreement across all features

Pipelines are synchronized. The failure is in the data distribution — what's arriving in production is genuinely different from what the model was trained on. You need distribution analysis (PSI, KS-test) to quantify the gap and targeted data collection to close it.

Disagreement on a small number of features at high rates (> 40%)

Classic pipeline drift. One or two specific transformations diverged. Fix the highest-impact features first. Rerun the comparison after each fix.

Disagreement on many features at low rates (5–15%)

Systemic drift — often a library version change or a cross-language reimplementation. Look for a single root cause that explains many features simultaneously.

Disagreement that correlates with specific raw input characteristics

Conditional divergence — the pipelines agree on most inputs but differ on edge cases. Look for branching logic, missing value handling, or out-of-vocabulary category handling that differs between pipelines.

The fix sequence

  1. 01Fix the highest impact-score features first.
  2. 02After each fix, rerun the full comparison. Confirm the disagreement rate dropped.
  3. 03Watch for secondary divergences that were masked by the primary one. Fixing one feature sometimes reveals another that was hidden underneath.
  4. 04When disagreement is zero across all features — run Check 02 again. Confirm inference equivalence still holds with the corrected features.
  5. 05Re-evaluate your model on production data with the corrected pipeline. Measure the actual remaining performance gap.

After you've run all three checks

You now know one of three things:

You found and fixed the root cause. Production performance recovers. The system works.

You found something but you're not sure it's the whole story. Partial improvement. There may be a secondary failure mode the first fix revealed.

You found nothing across all three checks. The failure is subtler — a genuine distribution shift, a concept drift, or an architectural issue that these checks don't surface directly.

In the second and third cases — that's what the diagnosis is for. Pattern recognition from 40+ systems. Knowing which combination of failure modes you're dealing with, what the fix sequence looks like, and whether your deadline is achievable.

📘Free · Full technical playbook

The Ground Truth Framework
The complete diagnostic playbook — 125 pages.

125
pages
27
figures & charts
40+
code examples
9
verticals covered

Full research foundation (Sculley, Zhang, Breck, DoorDash post-mortem). Every check with runnable code and interpretation tables. Four detailed case studies across defense, medical, industrial, and SaaS. Vertical-specific guides. Bibliography.

Unlock the playbook

No spam. One email when the guide gets major updates.

Found something and not sure what to do with it?

Fixed everything you found and still broken? Found nothing and still broken?

Get diagnosed — $1,500

48 hours · Moe personally · Full refund if fewer than 3 findings · No calls