ENGINEERING

5 min read

November 10, 2025

Docker for ML Engineers: Beyond 'It Works on My Machine'

Multi-stage builds for ML, GPU passthrough, model artifact management, and the anti-patterns that bloat your images to 15GB.

Why ML Docker Is Different

Standard Docker best practices get you 80% of the way for ML systems. The other 20% will cost you days of debugging if you don't know the gotchas.

ML containers have unique challenges: large model artifacts (100MB-10GB), GPU driver compatibility, CUDA version pinning, Python dependency hell, and the need for reproducible training environments.

The Multi-Stage Build Pattern

The biggest anti-pattern in ML Docker: one massive image with training dependencies, inference dependencies, Jupyter, development tools, and every Python package ever pip-installed. These images routinely hit 10-15GB.

Use multi-stage builds to separate concerns:

# Stage 1: Training image (large, has everything)
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04 AS trainer
RUN pip install torch torchvision transformers wandb
COPY training/ /app/training/
RUN python /app/training/train.py --output /models/

# Stage 2: Inference image (small, production-only)
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04 AS inference
RUN pip install torch --index-url https://download.pytorch.org/whl/cu122
RUN pip install onnxruntime-gpu fastapi uvicorn
COPY --from=trainer /models/ /models/
COPY serving/ /app/serving/
EXPOSE 8080
CMD ["uvicorn", "app.serving.main:app", "--host", "0.0.0.0", "--port", "8080"]

The training image might be 12GB. The inference image is 3GB. Your production cluster only ever pulls the 3GB image.

GPU Passthrough

For NVIDIA GPUs, you need the NVIDIA Container Toolkit installed on the host, and your base image must match the host's CUDA driver compatibility.

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update && apt-get install -y nvidia-container-toolkit

# Run with GPU
docker run --gpus all -p 8080:8080 my-model:latest

Critical rule: The CUDA version in your container must be compatible with the NVIDIA driver on the host. Container CUDA ≤ host driver's max supported CUDA. Check with nvidia-smi on the host.

Model Artifact Management

Don't bake large models into the Docker image. This makes every image rebuild download gigabytes, slows CI/CD, and wastes storage.

Instead:

# Bad: model in image (rebuilt every time)
COPY models/bert-large.onnx /models/

# Good: download at startup or mount as volume
ENV MODEL_PATH=/models/bert-large.onnx
CMD ["sh", "-c", "python download_model.py && uvicorn app:main"]

For production, use a model registry (MLflow, Weights & Biases, or S3) and download at container startup. This decouples model versions from container versions.

Reproducible Training Environments

Pin everything. Not just Python packages — pin the base image digest, CUDA version, and system libraries.

# Pin base image by digest, not tag
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04@sha256:abc123...

# Pin Python version
RUN apt-get install -y python3.11

# Use lock file, not requirements.txt
COPY requirements.lock /app/
RUN pip install --no-deps -r requirements.lock

Generate the lock file with pip freeze or pip-compile in a known-good environment. Never use >= version specifiers in production Dockerfiles.

Common Anti-Patterns

1. Installing Jupyter in production images. Your production inference server doesn't need Jupyter. Keep development and production images separate.

2. Running as root. Create a non-root user. This isn't just a security practice — it prevents accidental writes to system directories that can corrupt your environment.

RUN useradd -m -s /bin/bash mluser
USER mluser

3. Not using .dockerignore. Without it, Docker copies your entire project directory into the build context — including .git/, data/, notebooks/, and that 5GB dataset you forgot about.

# .dockerignore
.git
data/
notebooks/
*.pyc
__pycache__
.env
wandb/

4. Pip installing in one giant RUN command. Split your requirements into layers — system packages, base ML frameworks, and application-specific packages. This maximizes Docker layer caching.

# Layer 1: System packages (rarely changes)
RUN apt-get update && apt-get install -y libgl1-mesa-glx libglib2.0-0

# Layer 2: ML framework (changes occasionally)
COPY requirements-base.txt .
RUN pip install -r requirements-base.txt

# Layer 3: Application packages (changes frequently)
COPY requirements-app.txt .
RUN pip install -r requirements-app.txt

Health Checks

Always include a health check in your ML container:

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

The health endpoint should verify the model is loaded and can perform inference — not just that the HTTP server is running.

@app.get("/health")
def health():
    try:
        # Actually run a dummy inference
        result = model.predict(DUMMY_INPUT)
        return {"status": "healthy", "model_loaded": True}
    except Exception as e:
        return JSONResponse(status_code=503, content={"status": "unhealthy", "error": str(e)})

The Production Checklist for ML Docker

Multi-stage build separates training from inference
Base image pinned by digest
All Python packages pinned to exact versions
GPU driver compatibility verified
Non-root user configured
Health check endpoint implemented
Model artifacts loaded at runtime, not baked in
.dockerignore excludes data, notebooks, and .git
Logs go to stdout/stderr (not files)
Resource limits (memory, GPU) configured in deployment

Docker for ML isn't hard. It's just different enough from standard Docker to trip you up if you don't know the gotchas. Get the fundamentals right, and your deployment pipeline becomes boring — which is exactly what you want.

Discussion (2)

eng_manager_techEngineering Manager · Technology1 week ago

Solid technical depth. This is the kind of content that makes me actually trust a vendor — they clearly know what they're talking about because nobody writes at this level of specificity without real experience.

Mostafa DhouibAuthor1 week ago

That's the goal — we write about what we've actually done, not what we've read about. Every article is based on real deployment experience, real numbers, real failures. Thanks for reading.

Mostafa DhouibFounder & ML Engineer at Opulion