Case studies
What finding the gap
actually looks like
Four systems. Four verticals. Four different failure modes. Same diagnostic sequence. Every one found in 48 hours or less.
These are composite case studies drawn from real engagements. Specific details have been generalized to protect client confidentiality.
Defense
Situation
Defense tech startup with an autonomous UAS classification system. Series A funded. DoD demo in 18 days. Field accuracy: 67% (lab: 94%). Two contractors over 4 months had failed to close the gap. Needed 85% to proceed.
What two contractors looked at
What we found
V100 (training) → A100 (production). TF32 default precision on A100 was silently flipping edge-case classifications.
Normalization parameters were 8 months stale. Color space was BGR (OpenCV capture) vs RGB (PyTorch DataLoader). Every input frame had reversed channels.
Fix sequence
Outcome
Demoed at 91% field accuracy. Series A extension completed. DoD program proceeded to next phase.
What this illustrates
Two contractors rebuilt the model twice. The model didn't need rebuilding. The environment needed 2 config fixes and 1 pipeline correction. Total time: 6 days.
Medical
Situation
Medical AI company with an AI-assisted diagnostic tool in FDA De Novo submission. 22% error rate on portable field devices (4% on hospital equipment). 14 months into the submission process. Delay cost: $50K–$200K/month.
What contractors looked at
What we found
Training data sourced from 3 academic centers with high-end imaging equipment. Portable device had fundamentally different sensor characteristics, compression artifacts, and dynamic range. PSI > 0.4 across key features.
Validated. No divergence found. Pipeline was clean.
Fix sequence
Outcome
Submission timeline recovered. 2 months of targeted work replaced what would have been a 6-month delay. FDA pathway continued.
What this illustrates
The model was fine. The data was fine — for hospital equipment. Nobody had measured the distribution shift between device types. A single PSI calculation would have surfaced this on day one.
Industrial
Situation
ML-powered quality inspection system on an automotive production line. 31% false positive rate, 12% false negative rate. Client threatening contract termination. One month to resolve. Previous contractor blamed sensor noise, added vibration dampening hardware. No improvement.
What the contractor looked at
What we found
Rolling window size: 30 seconds (training) vs 60 seconds (serving) due to an ambiguous specification. 63–79% feature disagreement between environments.
Python 3.9 (training) vs 3.11 (serving). Data augmentation transforms were incorrectly included in the serving pipeline — production inputs were being randomly augmented before inference.
Fix sequence
Outcome
Contract termination retracted. Extension signed 3 months later for two additional production lines.
What this illustrates
The contractor added hardware. The problem was software. Augmentation in serving is a silent killer — the system was randomly flipping and rotating production images before classifying them.
SaaS / Fintech
Situation
AI-powered fraud detection system. 91% offline precision vs 58% in production — a 33-point gap. Two contractors over 6 months hadn't closed it. Series B due diligence audit approaching.
What two contractors tried
What we found
LabelEncoder (alphabetical ordering) in training vs custom FastAPI encoder (first-seen ordering) in serving. ~40% of categorical values had incorrect codes.
UTC timestamps in training data. US-East local time in serving pipeline. 5–6 hour offset corrupting every time-based feature.
Minor distribution shift. PSI 0.14 — present but not the primary cause.
Fix sequence
Outcome
58% → 87% in 2 working days. Series B technical audit passed. Distribution shift flagged for ongoing monitoring.
What this illustrates
Two contractors spent 6 months optimizing the model. The model was never the problem. Two pipeline mismatches — encoding order and timezone — accounted for 29 of the 33-point gap.
Recognize your system
in one of these?
The failure modes in these case studies are the same failure modes present in every broken production ML system we've examined.