The CTO's Guide to Not Getting Burned by ML Vendors
Hard-won lessons from evaluating dozens of ML vendors and consultancies. What to look for, what to run from, and how to structure engagements that actually deliver production systems instead of impressive demos.
A demo is not a deployment. A notebook is not a pipeline. A vendor who cannot explain their failure modes does not understand their own system.
The Demo Trap
I have sat through more ML vendor demos than I care to count. The pattern is always the same. Polished slides. A Jupyter notebook running on cherry-picked data. Accuracy numbers that would make a Kaggle grandmaster weep with joy. A timeline that promises production deployment in eight weeks. A price tag that seems reasonable for what they are promising.
Then reality hits.
The demo dataset was 50,000 rows of clean, balanced data. Your actual data has 14 million rows with 30% missing values, six different date formats, and a class imbalance ratio of 200:1. The "eight weeks to production" estimate assumed you had a feature store, a CI/CD pipeline for ML, a monitoring stack, and a team that understood model versioning. You have none of those things.
This is not hypothetical. I have seen this exact scenario play out at three different companies in the past two years. The vendor collects their first milestone payment, delivers a notebook that performs well on test data, and then the project stalls for months as everyone realizes that going from notebook to production is where 90% of the actual work lives.
Red Flags I Have Learned to Spot
After years of both being a vendor and evaluating vendors, here are the signals that tell me an engagement is headed for trouble before a single line of code is written.
They Lead With Model Architecture, Not Problem Definition
If the first meeting is about transformers, diffusion models, or whatever architecture is trending on Arxiv this month, be cautious. The best ML engineers I know spend 60% of a project on problem definition, data understanding, and pipeline architecture. The model itself is often the simplest part.
When a vendor shows up with a pre-selected architecture before understanding your data, your infrastructure, your latency requirements, your team's capabilities, and your actual business problem, they are selling a solution looking for a problem. Good consultancies will push back on your initial framing. They will ask uncomfortable questions about data quality. They will tell you when ML is not the right solution.
They Cannot Explain Their MLOps Story
Ask this question in every vendor evaluation: "Walk me through what happens after the model is trained. How does it get to production? How do you monitor it? What happens when it starts degrading? How do you retrain?"
If the answer is vague, if they defer to "your team will handle deployment," or if they look confused by the question, they are a research shop, not an engineering shop. Research shops produce notebooks. Engineering shops produce systems.
The MLOps story should cover:
- Model versioning and artifact management. Where do trained models live? How do you roll back to a previous version? What metadata is tracked?
- Feature pipeline architecture. How are features computed in production? Is there training-serving skew? How do you ensure the features computed at inference time match what the model saw during training?
- Monitoring and alerting. What metrics are tracked? What thresholds trigger alerts? Who gets paged?
- Retraining triggers and automation. Is retraining manual or automated? What triggers it? How do you validate a new model before promoting it?
They Promise Accuracy Numbers Before Seeing Your Data
Any vendor who quotes accuracy metrics before conducting a thorough data audit is either lying or incompetent. Both are dangerous.
Model performance is a function of data quality, label quality, feature engineering, class distribution, and a dozen other factors that cannot be assessed from a sales call. A responsible vendor will insist on a paid discovery phase before committing to performance targets. If someone guarantees 95% accuracy on your fraud detection system before seeing a single transaction, walk away.
Their Team Structure Is Wrong
Look at who they are proposing to staff on your project. A team of five PhDs with no production engineering experience will produce beautiful research that never ships. A team of software engineers with no ML experience will build a robust pipeline around a mediocre model.
The right team for a production ML engagement includes:
- An ML engineer who understands both modeling and systems. Someone who can train a model and also write the Dockerfile to serve it.
- A data engineer who can build and maintain the feature pipelines.
- A platform or DevOps engineer who understands your infrastructure and can integrate the ML system into it.
- A technical lead who has shipped ML systems to production before and can anticipate the problems before they become blockers.
If the vendor is proposing to send a single data scientist to work embedded with your team, you are paying consulting rates for a staff augmentation engagement. That is a different value proposition and should be priced accordingly.
How to Structure the Engagement
The single most important structural decision is breaking the engagement into phases with clear exit criteria. Never sign a single contract for "build an ML system" with one final deliverable. Here is how I structure engagements at Opulion:
Phase 1: Discovery and Data Audit (2-4 weeks)
The deliverable is a technical assessment document that covers:
- Data quality analysis with specific metrics on completeness, consistency, and relevance
- Feature engineering feasibility assessment
- Infrastructure gap analysis
- Realistic performance targets with confidence intervals
- Architecture recommendation with alternatives considered and rejected
- Risk register with mitigation strategies
- Go/no-go recommendation with justification
This phase should be priced as a standalone engagement. Both sides should have the option to walk away after discovery without further obligation. If a vendor will not agree to this structure, they are not confident in their ability to deliver honest assessments.
Phase 2: Proof of Concept (4-8 weeks)
The PoC should run on your actual data, in an environment that approximates production constraints. Not a Jupyter notebook on a vendor's machine. The exit criteria should include:
- Model performance on a held-out test set that both sides agreed on before training started
- Inference latency measurements on target hardware
- A documented feature pipeline that can be reproduced
- A clear path from PoC to production with identified gaps
Phase 3: Production Build (8-16 weeks)
This is where the actual engineering happens. The deliverable is a system, not a model. That system includes training pipelines, feature pipelines, serving infrastructure, monitoring, alerting, documentation, and runbooks.
Phase 4: Hardening and Handoff (4-8 weeks)
Load testing. Failure mode testing. Documentation review. Knowledge transfer sessions. Shadow mode deployment where the ML system runs alongside the existing process without making decisions. Gradual traffic migration.
The Questions You Must Ask
Here is a checklist I give to every CTO evaluating ML vendors. Ask these in the first technical meeting:
On data handling:
- How do you handle training-serving skew?
- What is your approach to feature stores?
- How do you version datasets?
On production readiness:
- What is your model serving stack?
- What is your P99 inference latency on comparable workloads?
- How do you handle model rollbacks?
- What does your CI/CD pipeline for ML look like?
On monitoring:
- What metrics do you track in production?
- How do you detect data drift?
- How do you detect concept drift?
- What is your mean time to detection for model degradation?
On failure modes:
- What happens when the model receives out-of-distribution inputs?
- What is your fallback strategy when the model is unavailable?
- How do you handle adversarial inputs?
On team and process:
- Who specifically will work on this project?
- What is their production ML experience?
- Can I talk to a reference from a similar engagement?
- What does your code review process look like?
Pricing Models and Their Incentives
How a vendor prices their work reveals how they think about delivery.
Time and materials aligns incentives toward thoroughness but creates open-ended cost risk. It works best when the problem is genuinely exploratory and the scope cannot be fixed upfront. Insist on weekly burn reports and a not-to-exceed cap.
Fixed price aligns incentives toward shipping but creates pressure to cut corners. It works best when the scope is well-defined and both sides have done enough discovery to understand the work. The risk is that the vendor delivers the letter of the contract rather than the spirit.
Outcome-based pricing sounds appealing but is almost impossible to implement fairly for ML projects. Model performance depends on data quality, which is usually the client's responsibility. Tying vendor compensation to accuracy metrics creates perverse incentives around test set construction and metric selection.
The approach I recommend is time and materials for discovery, fixed price for PoC with clear exit criteria, and time and materials with a cap for production build. This gives both sides appropriate risk exposure at each phase.
The Build vs. Buy Decision
Sometimes the right answer is not to hire a vendor at all. Build internally when:
- ML is a core competency you need to develop and retain
- The problem domain requires deep institutional knowledge that is hard to transfer
- You have the team and the patience for a longer timeline
- The system will need continuous iteration that makes ongoing vendor relationships expensive
Hire a vendor when:
- You need to move fast and do not have the team yet
- The problem is well-defined and the vendor has solved similar problems before
- You want to de-risk a technology bet before committing internal resources
- You need to augment a small internal team with specialized expertise
The worst outcome is hiring a vendor to build a system that your team cannot maintain after the engagement ends. If you go the vendor route, knowledge transfer is not optional. It is the primary deliverable.
What Good Looks Like
The best vendor engagement I have been part of ended with the client's team confidently operating, monitoring, and iterating on the system without needing us. We worked ourselves out of a job. That is the point.
The system had comprehensive documentation. The team had been embedded with us throughout the build. The monitoring dashboards were in their Grafana instance, not ours. The runbooks were written in their format, stored in their wiki. The code was in their repository, following their conventions.
We got a call six months later. Not because something broke, but because they wanted to extend the system to a new use case and wanted our input on the architecture. That is the right relationship. Trusted advisor, not permanent dependency.
The ML vendor landscape is full of brilliant people building impressive technology. But brilliance does not ship systems. Engineering discipline ships systems. When you are evaluating vendors, look for the ones who are more excited about your data pipeline than your model architecture. Look for the ones who ask about your on-call rotation before they ask about your GPU budget. Look for the ones who tell you what they cannot do.
Those are the ones who will not burn you.
Discussion (4)
Just went through vendor evaluation for our fraud detection system. Talked to 8 vendors. 6 of them couldn't answer basic questions about class imbalance handling or explain what happens when their model encounters a distribution shift. The 'ask them to explain a failure' test from this article would have saved us 3 months.
The failure question is the single best filter. Anyone can show you a demo with clean data. Ask them: 'Tell me about a time your model failed in production and what you did about it.' If they can't answer with specifics — exact error rates, root cause, fix timeline — they haven't shipped production ML. It's that simple.
We evaluated a well-known AI consultancy (Big 4 spinoff). Their proposal was $450K for 6 months. When I asked about edge deployment constraints (ITAR, air-gapped networks, power budgets), the room went quiet. They'd never deployed outside a cloud environment. This checklist should be mandatory reading before any vendor conversation.
Edge + defense is a completely different world from cloud ML. If a vendor can't immediately talk about model quantization tradeoffs, ONNX vs TensorRT for your specific hardware, and how they handle OTA updates on air-gapped systems — they're going to learn on your dime. $450K to fund someone's education is a tough sell.