Case studies

What finding the gap
actually looks like

Four systems. Four verticals. Four different failure modes. Same diagnostic sequence. Every one found in 48 hours or less.

These are composite case studies drawn from real engagements. Specific details have been generalized to protect client confidentiality.

01Defense / UAS · Autonomy system

Defense

Situation

Defense tech startup with an autonomous UAS classification system. Series A funded. DoD demo in 18 days. Field accuracy: 67% (lab: 94%). Two contractors over 4 months had failed to close the gap. Needed 85% to proceed.

What two contractors looked at

1Architecture review → rebuilt model from scratch. Accuracy unchanged.
2More training data (6 weeks collecting) → accuracy unchanged.

What we found

FM01Environment

V100 (training) → A100 (production). TF32 default precision on A100 was silently flipping edge-case classifications.

FM02Pipeline

Normalization parameters were 8 months stale. Color space was BGR (OpenCV capture) vs RGB (PyTorch DataLoader). Every input frame had reversed channels.

Fix sequence

Day 1TF32 config correction
67% → 74%
Day 2Normalization refresh + color space fix
74% → 89%
Day 3Full validation sweep
Cleared 85% threshold

Outcome

Demoed at 91% field accuracy. Series A extension completed. DoD program proceeded to next phase.

What this illustrates

Two contractors rebuilt the model twice. The model didn't need rebuilding. The environment needed 2 config fixes and 1 pipeline correction. Total time: 6 days.

02Medical AI · FDA De Novo · Diagnostic imaging

Medical

Situation

Medical AI company with an AI-assisted diagnostic tool in FDA De Novo submission. 22% error rate on portable field devices (4% on hospital equipment). 14 months into the submission process. Delay cost: $50K–$200K/month.

What contractors looked at

1Architecture review → model architecture correct, blamed data quality.
2Re-ran field validation with stricter protocols → error rate dropped to 19%. Still failing.

What we found

FM01Distribution

Training data sourced from 3 academic centers with high-end imaging equipment. Portable device had fundamentally different sensor characteristics, compression artifacts, and dynamic range. PSI > 0.4 across key features.

FM02Pipeline

Validated. No divergence found. Pipeline was clean.

Fix sequence

Week 1–2Collected 800 images from the portable device
Representative distribution captured
Week 3Fine-tuned model (not full retrain) on blended dataset
Error rate → 9%
Week 4Second round — 400 additional images + validation
Error rate → 5.8%

Outcome

Submission timeline recovered. 2 months of targeted work replaced what would have been a 6-month delay. FDA pathway continued.

What this illustrates

The model was fine. The data was fine — for hospital equipment. Nobody had measured the distribution shift between device types. A single PSI calculation would have surfaced this on day one.

03Industrial automation · Quality inspection · Automotive

Industrial

Situation

ML-powered quality inspection system on an automotive production line. 31% false positive rate, 12% false negative rate. Client threatening contract termination. One month to resolve. Previous contractor blamed sensor noise, added vibration dampening hardware. No improvement.

What the contractor looked at

1Blamed sensor noise and vibration.
2Added vibration dampening to mounting hardware. Accuracy unchanged.

What we found

FM01Pipeline

Rolling window size: 30 seconds (training) vs 60 seconds (serving) due to an ambiguous specification. 63–79% feature disagreement between environments.

FM02Artifact

Python 3.9 (training) vs 3.11 (serving). Data augmentation transforms were incorrectly included in the serving pipeline — production inputs were being randomly augmented before inference.

Fix sequence

Day 1Removed augmentation from serving pipeline
31% → 18% FP
Day 2Corrected window size + locked Python version
4% FP, 2% FN
Day 3Validation across production runs
Results confirmed

Outcome

Contract termination retracted. Extension signed 3 months later for two additional production lines.

What this illustrates

The contractor added hardware. The problem was software. Augmentation in serving is a silent killer — the system was randomly flipping and rotating production images before classifying them.

04B2B SaaS · Fraud detection · Series A

SaaS / Fintech

Situation

AI-powered fraud detection system. 91% offline precision vs 58% in production — a 33-point gap. Two contractors over 6 months hadn't closed it. Series B due diligence audit approaching.

What two contractors tried

1First contractor → feature selection and regularization → 93% offline, 59% production.
2Second contractor → class imbalance correction and threshold tuning → 61% production.

What we found

FM01Pipeline

LabelEncoder (alphabetical ordering) in training vs custom FastAPI encoder (first-seen ordering) in serving. ~40% of categorical values had incorrect codes.

FM02Pipeline

UTC timestamps in training data. US-East local time in serving pipeline. 5–6 hour offset corrupting every time-based feature.

FM03Distribution

Minor distribution shift. PSI 0.14 — present but not the primary cause.

Fix sequence

Day 1Replaced serving encoder with training LabelEncoder
58% → 73%
Day 2UTC conversion in serving pipeline
73% → 87%
Day 3Validation + monitoring setup
Results stable

Outcome

58% → 87% in 2 working days. Series B technical audit passed. Distribution shift flagged for ongoing monitoring.

What this illustrates

Two contractors spent 6 months optimizing the model. The model was never the problem. Two pipeline mismatches — encoding order and timezone — accounted for 29 of the 33-point gap.

Recognize your system
in one of these?

The failure modes in these case studies are the same failure modes present in every broken production ML system we've examined.