Back to Blog
ENGINEERING
10 min read
March 3, 2026

The 51-Point Production ML Checklist

A comprehensive checklist covering every critical aspect of deploying ML systems to production — from data validation and model testing to monitoring, security, and organizational readiness. Forged from real deployment failures.

Every item on this checklist exists because we learned it the hard way.

Why This Checklist Exists

Over four years of deploying ML systems into production across manufacturing, oil and gas, defense, biotech, and SaaS companies, I have accumulated a list of everything that can go wrong. Every item on this checklist represents a real production incident, a delayed deployment, or a 2 AM page that could have been prevented.

Most "production ML" resources focus on the model. The model is maybe 5% of the production system. This checklist covers the other 95%: the data pipelines, the serving infrastructure, the monitoring, the security, the organizational processes, and the documentation that determine whether your ML system actually works when real users depend on it.

Use this as a pre-flight checklist before any production deployment. Not every item applies to every project. But every item you skip should be a conscious decision, not an oversight.


Data Pipeline

  • [ ] 1. Schema validation is enforced on all data inputs. Every field has a defined type, range, and nullability constraint. Schema violations are caught before they reach the feature pipeline.

  • [ ] 2. Data freshness monitoring is active. You have alerts for when expected data arrives late or stops arriving entirely. A stale feature is often worse than a missing feature.

  • [ ] 3. Feature distributions are monitored against training baselines. You are tracking statistical distance (KL divergence, PSI, or similar) between production feature distributions and the distributions the model was trained on.

  • [ ] 4. Point-in-time correctness is verified. Your feature pipeline produces the same features at inference time as it would have during training for any given historical timestamp. No future data leaks into the feature set.

  • [ ] 5. Null handling is explicit and documented. Every feature has a documented strategy for null values: impute with median, impute with zero, drop the row, use a missing indicator, or reject the request.

  • [ ] 6. Data pipeline is idempotent. Running the same pipeline on the same input data produces the same output. No side effects, no dependency on execution order.

  • [ ] 7. Raw data is preserved immutably. You never overwrite raw data. Every transformation produces a new versioned artifact. You can always go back to the source.

  • [ ] 8. Feature computation code is shared between training and serving. There is a single implementation of each feature transformation, not a Python version for training and a Java version for serving.

  • [ ] 9. Data quality metrics are logged and dashboarded. Null rates, cardinality, value ranges, and distribution statistics are computed continuously and visible to the team.

  • [ ] 10. Upstream data source changes have a notification mechanism. You know when a team that owns your upstream data changes their schema, semantics, or delivery schedule.

Model Development

  • [ ] 11. Training data is versioned. You can reproduce any training run using the exact dataset it was trained on, not "approximately the same data."

  • [ ] 12. The training-serving skew test passes. You have a test that feeds the same raw inputs through both the training pipeline and the serving pipeline and verifies the features match.

  • [ ] 13. Model performance is evaluated on a held-out test set that reflects production distribution. Not a random split from six months ago. A test set that represents current production data characteristics.

  • [ ] 14. Performance is evaluated across relevant subgroups. Aggregate metrics hide subgroup failures. If your model serves different segments (product types, geographies, customer tiers), evaluate each one independently.

  • [ ] 15. The model baseline is documented. You know what performance the current system (rules-based, human, or previous model) achieves, and the new model demonstrably improves on it.

  • [ ] 16. Model size and complexity are justified. If a logistic regression achieves 95% of the performance of a 1B parameter transformer, you should be deploying the logistic regression unless you have a documented reason for the complexity.

  • [ ] 17. Random seeds are fixed and documented. All stochastic processes (data shuffling, weight initialization, dropout) use recorded seeds. You can reproduce training results exactly.

  • [ ] 18. Hyperparameter choices are documented with rationale. Not just the final values -- the search process, the validation results, and the reasoning for the choices.

Model Testing

  • [ ] 19. Unit tests exist for all feature transformations. Each feature computation function has tests that verify correct output for known inputs, including edge cases.

  • [ ] 20. Integration tests verify the full pipeline. From raw input to final prediction, there is an automated test that exercises the entire inference path.

  • [ ] 21. Model performance regression tests are automated. A CI check that evaluates model performance on a benchmark dataset and fails if metrics drop below thresholds.

  • [ ] 22. Latency benchmarks are measured and gated. You know the p50, p95, and p99 inference latency under expected load, and CI fails if latency exceeds SLA.

  • [ ] 23. Load testing has been performed. You have verified that the serving infrastructure handles expected peak traffic with acceptable latency and error rates.

  • [ ] 24. Edge case inputs are explicitly tested. Empty inputs, maximum-length inputs, adversarial inputs, inputs with all nulls, inputs from distribution tails. Each has a test case.

  • [ ] 25. The model fails gracefully on unexpected inputs. Invalid inputs produce clear error messages, not stack traces, NaN outputs, or silent wrong answers.

Serving Infrastructure

  • [ ] 26. Model serving has horizontal scaling configured. You can add serving replicas under load without manual intervention.

  • [ ] 27. Health check endpoints exist and are monitored. A /health endpoint that verifies the model is loaded and the feature pipeline is connected, not just that the HTTP server is running.

  • [ ] 28. Request/response logging is active. Every prediction request and response is logged with a unique request ID, timestamp, and latency. Logging does not impact serving latency.

  • [ ] 29. A rollback mechanism exists and has been tested. You can revert to the previous model version within minutes, and you have actually tested the rollback procedure -- not just assumed it works.

  • [ ] 30. Resource limits are configured. CPU, memory, and GPU limits are set on serving containers. One misbehaving request cannot consume all available resources.

  • [ ] 31. Timeout and retry logic is configured. Downstream consumers have appropriate timeouts for model inference calls, and retry logic handles transient failures without creating thundering herd problems.

  • [ ] 32. The circuit breaker fallback is implemented and tested. When the model is unavailable, a fallback mechanism (cached prediction, simple model, rule-based default) handles requests.

  • [ ] 33. Shadow deployment has been validated. The new model has run in shadow mode receiving production traffic for a defined period, and its predictions have been compared against the current production model.

Monitoring and Alerting

  • [ ] 34. Input data quality alerts are configured. Alerts fire when input data violates schema, shifts distribution, or arrives late. These alerts have defined owners and response procedures.

  • [ ] 35. Prediction distribution is monitored. You are tracking the distribution of model outputs over time. A sudden shift in prediction distribution (even without ground truth) indicates a problem.

  • [ ] 36. Model performance metrics are computed when ground truth is available. As labels arrive (immediately, hours later, or weeks later), actual performance metrics are computed and tracked.

  • [ ] 37. System health metrics are monitored. CPU utilization, memory usage, GPU utilization, disk usage, network latency, error rates. These have alerts with defined thresholds.

  • [ ] 38. Alert fatigue has been addressed. Alerts are tuned so that every alert requires action. If an alert fires daily and nobody responds, it is either tuned wrong or monitoring the wrong thing.

  • [ ] 39. Dashboards exist for each stakeholder. ML engineers see model metrics. Platform engineers see infrastructure metrics. Business stakeholders see outcome metrics. Each dashboard shows what that audience needs.

  • [ ] 40. An on-call rotation exists. Someone is responsible for responding to production ML alerts at all times. That person has the context and access to diagnose and resolve issues.

Security and Compliance

  • [ ] 41. Model artifacts are stored in access-controlled storage. Model weights, training data, and configuration files are not publicly accessible and access is audited.

  • [ ] 42. Inference endpoints require authentication. Your model API is not open to the internet without authentication and rate limiting.

  • [ ] 43. PII handling is compliant with applicable regulations. If your model processes personal data, GDPR, CCPA, HIPAA, or sector-specific regulations are satisfied. Data retention and deletion policies are implemented.

  • [ ] 44. Input validation prevents injection attacks. If your model accepts text or structured input, inputs are validated and sanitized before processing.

  • [ ] 45. Model provenance is auditable. For any model in production, you can trace back to the exact training data, code version, and configuration that produced it.

Organizational Readiness

  • [ ] 46. Runbooks exist for common failure modes. The on-call engineer has written procedures for: model performance degradation, data pipeline failure, serving infrastructure outage, and rollback.

  • [ ] 47. Stakeholders are aligned on success metrics. The ML team, product team, and business stakeholders agree on what metrics define success and what thresholds trigger action.

  • [ ] 48. A model retraining schedule is defined. You know when and why the model will be retrained. Retraining is triggered by performance degradation, data drift, or a calendar schedule -- not by panic.

  • [ ] 49. The handoff from development to operations is documented. If the ML engineer who built the model leaves, someone else can maintain, retrain, and troubleshoot the production system.

  • [ ] 50. Incident response procedures exist. When the model produces a bad outcome, there is a defined process for: identifying the root cause, communicating to stakeholders, implementing a fix, and conducting a post-mortem.

  • [ ] 51. The deprecation plan exists. You know the conditions under which this model will be retired, what replaces it, and how to decommission the infrastructure.


How to Use This Checklist

Do not try to check every box before your first deployment. That way lies analysis paralysis and shipping nothing.

For an MVP or first production deployment: Items 1-9 (data pipeline), 11-13 (model development), 19-20 (basic testing), 26-29 (serving basics), 34-37 (core monitoring), and 42 (security). That is 22 items. Get these right and you have a production system you can sleep on.

For a mature production deployment: All 51 items should be addressed. Not all will apply to every project, but each should be explicitly evaluated and either implemented or documented as not applicable with reasoning.

For regulated industries (healthcare, defense, finance): Items 41-45 and 49-51 are non-negotiable from day one. Regulatory compliance is not something you bolt on after launch.

The checklist is a living document. Every production incident should trigger a review: is there a checklist item that would have caught this? If not, add one. Our internal version is at 67 items and growing.

The goal is not perfection. The goal is conscious decision-making about which risks you are accepting and which you have mitigated. An ML system where the team explicitly decided to skip load testing (item 23) because traffic is low and predictable is in much better shape than an ML system where nobody thought about load testing at all.

Ship it. Monitor it. Improve it. But check the boxes first.

Discussion (2)

ML
ml_lead_logisticsML Team Lead · Logistics1 week ago

Printed this out and taped it to the wall. We hit 23 out of 51 on our route optimization system. The monitoring section alone revealed 4 blind spots we didn't know we had. Already caught a silent data drift issue that was degrading predictions by 12% over the last month.

M
Mostafa DhouibAuthor1 week ago

Silent drift is the worst — your model is wrong but nobody knows until a human notices downstream. 12% degradation over a month means your retraining trigger threshold was set wrong or didn't exist. The fix: set up a prediction distribution monitor that alerts when the output distribution diverges from baseline by more than a configurable threshold. Simple to implement, saves you from these slow bleeds.

M
Mostafa DhouibFounder & ML Engineer at Opulion

Facing a similar challenge?

Tell us about your problem. We'll respond with an honest technical assessment within 24 hours.