Back to Blog
ENGINEERING
5 min read
February 10, 2026

The Art of ML Model Compression: From 1.5GB to 2.8MB

Practical walkthrough of quantization, pruning, distillation, and architecture search for deploying models on edge hardware.


Why Compression Matters

Your ResNet-50 is 97MB. Your BERT model is 420MB. Your custom vision model is 1.5GB. And your target device has 2GB of RAM, 8GB of storage, and a 5W power budget.

This is the reality of edge AI deployment. The models that work brilliantly in the cloud need to shrink by 100-500x to run on industrial hardware. Here's how we do it.

The Compression Toolkit

1. Quantization

The highest-impact, lowest-effort technique. Convert 32-bit floating point weights to 8-bit integers. Immediate 4x size reduction with typically less than 1% accuracy loss.

import torch
from torch.quantization import quantize_dynamic

# Dynamic quantization (simplest)
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear, torch.nn.Conv2d},
    dtype=torch.qint8
)

# Static quantization (better accuracy, requires calibration)
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_prepared = torch.quantization.prepare(model)

# Calibrate with representative data
for batch in calibration_loader:
    model_prepared(batch)

model_quantized = torch.quantization.convert(model_prepared)

Results from our deployments:

  • ResNet-50: 97MB → 24MB (4x), accuracy drop: 0.3%
  • DistilHuBERT: 95MB → 24MB (4x), accuracy drop: 0.8%
  • YOLOv8-s: 22MB → 5.6MB (4x), mAP drop: 1.2%

For INT4 quantization (8x reduction), accuracy loss increases to 2-5% but can be acceptable for many use cases.

2. Structured Pruning

Remove entire channels or layers that contribute least to the output. Unlike unstructured pruning (which zeros individual weights), structured pruning actually reduces computation and model size.

import torch.nn.utils.prune as prune

def structured_prune(model, amount=0.3):
    """Remove 30% of channels with lowest L1 norm."""
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Conv2d):
            prune.ln_structured(module, name='weight', amount=amount, n=1, dim=0)
            prune.remove(module, 'weight')
    return model

After pruning, fine-tune for 10-20% of original training epochs to recover accuracy. We typically see:

  • 30% pruning: 2x speedup, under 1% accuracy loss after fine-tuning
  • 50% pruning: 3x speedup, 1-3% accuracy loss after fine-tuning
  • 70% pruning: 5x speedup, 3-8% accuracy loss after fine-tuning

3. Knowledge Distillation

Train a small "student" model to mimic a large "teacher" model. The student learns from the teacher's soft probability distributions, which contain more information than hard labels.

def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.7):
    """Combined distillation and classification loss."""
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=1),
        F.softmax(teacher_logits / temperature, dim=1),
        reduction='batchmean'
    ) * (temperature ** 2)

    hard_loss = F.cross_entropy(student_logits, labels)
    return alpha * soft_loss + (1 - alpha) * hard_loss

In our Edge Voice project, we distilled a full HuBERT model (95M params) into DistilHuBERT (23M params) with fine-tuned distillation on our domain-specific audio dataset. The student retained 96% of the teacher's accuracy at 25% of the size.

4. ONNX Runtime Optimization

After compression, ONNX Runtime's graph optimizer applies additional optimizations: operator fusion, constant folding, and memory planning.

import onnxruntime as ort

# Convert PyTorch → ONNX
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=17)

# Optimize with ONNX Runtime
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.optimized_model_filepath = "model_optimized.onnx"

session = ort.InferenceSession("model.onnx", sess_options)

5. TensorRT for NVIDIA Hardware

For Jetson deployments, TensorRT provides the best inference performance through layer fusion, kernel auto-tuning, and precision calibration.

import tensorrt as trt

# Build TensorRT engine from ONNX
builder = trt.Builder(TRT_LOGGER)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)  # Enable FP16
config.set_flag(trt.BuilderFlag.INT8)  # Enable INT8
config.int8_calibrator = calibrator    # Requires calibration dataset

engine = builder.build_engine(network, config)

The Compression Pipeline

Our standard pipeline for edge deployment:

  1. Train full-precision model to best accuracy
  2. Prune 30-50% of channels (structured)
  3. Fine-tune pruned model for 10-20 epochs
  4. Distill if size is still too large
  5. Quantize to INT8
  6. Export to ONNX
  7. Optimize with TensorRT (NVIDIA) or ONNX Runtime
  8. Benchmark on target hardware
  9. Validate accuracy meets requirements

Real-World Results

Our best compression result: a custom vision model for industrial defect detection.

| Stage | Size | Latency (Jetson Orin) | Accuracy | |-------|------|----------------------|----------| | Original (FP32) | 1.5GB | 890ms | 97.2% | | Pruned (50%) | 380MB | 310ms | 96.8% | | Distilled | 45MB | 85ms | 96.1% | | Quantized (INT8) | 12MB | 34ms | 95.7% | | TensorRT optimized | 2.8MB | 12ms | 95.4% |

From 1.5GB/890ms to 2.8MB/12ms. That's a 535x size reduction and 74x speedup with only 1.8% accuracy loss.

The model now runs at 83 FPS on a Jetson Orin Nano, well within the 130 units/hour throughput requirement for the production line.

Common Pitfalls

Don't quantize before pruning. Quantization adds noise. Pruning removes parameters. Pruning a quantized model amplifies quantization errors. Always prune first, then quantize.

Calibrate quantization on representative data. Using random data or a small unrepresentative sample for calibration will shift the quantization ranges and destroy accuracy.

Test on edge hardware, not your dev machine. A model that runs perfectly on your RTX 4090 may behave differently on a Jetson. Memory layout, CUDA compute capability, and driver versions all matter.

Profile before optimizing. Measure where the bottleneck actually is. Sometimes the model inference is fast but data preprocessing is slow. Compressing the model won't help if the bottleneck is elsewhere.

The goal isn't the smallest model — it's the smallest model that meets your accuracy and latency requirements on your target hardware. Sometimes INT8 quantization alone gets you there. Sometimes you need the full pipeline. Let the requirements drive the compression strategy.

Discussion (2)

EM
eng_manager_techEngineering Manager · Technology1 week ago

Solid technical depth. This is the kind of content that makes me actually trust a vendor — they clearly know what they're talking about because nobody writes at this level of specificity without real experience.

M
Mostafa DhouibAuthor1 week ago

That's the goal — we write about what we've actually done, not what we've read about. Every article is based on real deployment experience, real numbers, real failures. Thanks for reading.

M
Mostafa DhouibFounder & ML Engineer at Opulion

Facing a similar challenge?

Tell us about your problem. We'll respond with an honest technical assessment within 24 hours.