Edge AI: When Milliseconds Matter More Than Megabytes
How we built an industrial voice command system that achieves 87% accuracy in 95dB factory noise on a 4-watt power budget. A deep dive into Conformer-Tiny, DSP pipelines, and edge deployment.
The cloud is a luxury. The edge is a constraint. Constraints breed better engineering.
The Problem Nobody Warns You About
When a client came to us with a request for a voice-activated control system for industrial machinery, the initial reaction from most vendors was to propose a cloud-based speech recognition pipeline. Capture audio on the factory floor, stream it to AWS, run it through a large speech model, send the command back. Simple. Proven. Wrong.
Here is why: the factory floor had 95dB ambient noise, the machinery operators wore ear protection, the WiFi infrastructure was unreliable due to electromagnetic interference from welding stations, and the latency requirement was under 200 milliseconds from utterance to machine response. A round trip to the cloud would take 300-800ms on a good day, assuming connectivity existed at all.
This is the edge AI problem in its purest form. When milliseconds matter more than megabytes, when connectivity is a luxury you cannot assume, and when your power budget is 4 watts because the device runs on battery during shift transitions, the entire architecture has to be rethought from first principles.
Why Cloud Will Not Work Here
Let me quantify the problem. We measured network conditions over two weeks on the factory floor:
- Average round-trip latency to nearest AWS region: 180ms (when connected)
- Connection dropout rate: 12% of 10-second windows had no connectivity
- Jitter: 40-200ms standard deviation
- Available bandwidth during peak shift: 2-8 Mbps shared across 40 devices
Even if we could tolerate the latency, the dropout rate makes cloud-dependent systems unusable for safety-critical commands. An operator issuing an emergency stop command cannot wait for the WiFi to reconnect. The system must work locally, every time, without exception.
The power constraint adds another dimension. The target device was a ruggedized tablet mounted to machinery with a battery backup during shift changeovers. A Jetson Orin running at full TDP would drain the battery in 45 minutes. We needed to stay under 4 watts average while maintaining real-time inference.
Architecture Decisions
We evaluated three architectures:
Option A: Full Conformer model (300M parameters). State-of-the-art accuracy. Requires server-grade GPU. 15W minimum. Eliminated immediately.
Option B: Whisper Tiny (39M parameters). Decent accuracy. Still requires ~8W on Jetson for continuous inference. Not designed for keyword spotting or command recognition. Too general-purpose.
Option C: Custom Conformer-Tiny (2.1M parameters). Purpose-built for a restricted vocabulary of 47 commands. Under 4W on Jetson Nano. This is what we built.
The key insight is that industrial voice command is not general speech recognition. We do not need to transcribe arbitrary English. We need to distinguish between 47 specific commands in a specific noise environment. This is a closed-vocabulary classification problem, not an open-vocabulary transcription problem. That distinction changes everything.
The DSP Pipeline
Before any audio reaches the neural network, it passes through a carefully tuned digital signal processing pipeline. This is where most of the noise rejection happens. The model does not need to learn to ignore factory noise if the DSP pipeline has already removed it.
class IndustrialDSPPipeline:
def __init__(self, sample_rate=16000):
self.sample_rate = sample_rate
self.frame_length = int(0.025 * sample_rate) # 25ms frames
self.frame_step = int(0.010 * sample_rate) # 10ms step
self.n_mels = 80
def process(self, raw_audio):
# Stage 1: High-pass filter to remove machinery rumble (<200Hz)
audio = self.highpass_filter(raw_audio, cutoff=200, order=6)
# Stage 2: Adaptive noise gate
# Factory noise is relatively stationary, so we estimate
# the noise floor from the first 500ms of each recording
noise_profile = self.estimate_noise_profile(audio[:self.sample_rate // 2])
audio = self.spectral_subtraction(audio, noise_profile)
# Stage 3: Voice Activity Detection
# Only process frames where speech energy exceeds noise floor by 6dB
vad_mask = self.energy_vad(audio, threshold_db=6)
# Stage 4: Log-mel spectrogram
mel_spec = self.compute_log_mel(audio, vad_mask)
# Stage 5: Cepstral Mean Variance Normalization
mel_spec = self.cmvn(mel_spec)
return mel_spec, vad_mask
def spectral_subtraction(self, audio, noise_profile, alpha=2.0):
"""
Berouti-style spectral subtraction.
alpha controls oversubtraction to prevent musical noise.
"""
stft = np.fft.rfft(
self.frame_signal(audio),
n=self.frame_length
)
magnitude = np.abs(stft)
phase = np.angle(stft)
# Subtract estimated noise spectrum
clean_magnitude = np.maximum(
magnitude ** 2 - alpha * noise_profile ** 2,
0.01 * magnitude ** 2 # Spectral floor
) ** 0.5
clean_stft = clean_magnitude * np.exp(1j * phase)
return self.overlap_add(np.fft.irfft(clean_stft))
The spectral subtraction is critical. Factory noise has a characteristic spectral profile: heavy energy below 500Hz from motors and compressors, broadband noise from pneumatic systems, and impulsive noise from stamping presses. By estimating this profile and subtracting it, we boost the effective SNR by 8-12dB before the audio ever reaches the model.
The voice activity detection stage saves compute. We only run the neural network on frames that contain speech. On a typical 3-second audio window, only 0.5-1.5 seconds contain actual speech. Skipping the rest reduces inference compute by 50-70%.
Conformer-Tiny Architecture
The Conformer architecture combines convolution and self-attention, which makes it excellent at capturing both local acoustic patterns and global temporal dependencies. Our Tiny variant strips it down to the essentials.
class ConformerTiny:
"""
2.1M parameter Conformer for 47-class command recognition.
Architecture:
- 2 Conformer blocks (vs 12-17 in full Conformer)
- 144 hidden dimension (vs 512)
- 4 attention heads (vs 8)
- 15 convolution kernel (vs 31)
- Single-layer MLP classifier
"""
def __init__(self):
self.encoder = ConformerEncoder(
num_blocks=2,
d_model=144,
num_heads=4,
conv_kernel_size=15,
ff_expansion_factor=4,
dropout=0.1
)
self.pooling = AttentiveStatisticsPooling(144)
self.classifier = nn.Sequential(
nn.Linear(288, 128), # 288 = mean + std from attentive pooling
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(128, 47) # 47 commands
)
def forward(self, mel_features, lengths):
# mel_features: [batch, time, 80]
encoded = self.encoder(mel_features, lengths) # [batch, time, 144]
pooled = self.pooling(encoded, lengths) # [batch, 288]
logits = self.classifier(pooled) # [batch, 47]
return logits
Two design decisions deserve explanation.
First, attentive statistics pooling instead of simple mean pooling. In noisy environments, some frames contain more reliable information than others. Attentive pooling learns to weight frames by their reliability, automatically downweighting frames where factory noise overwhelms the speech signal.
Second, convolution kernel size of 15 instead of the standard 31. The restricted vocabulary means we do not need to capture long-range acoustic dependencies within a single convolution. Each command is at most 3-4 syllables. A kernel of 15 frames (150ms at our frame rate) captures an entire syllable, which is sufficient for our task.
Training Strategy
We collected 12,000 utterances of each command from 35 speakers with varying accents, recorded on the actual factory floor during normal operations. We augmented this with:
- Noise augmentation: Mixing in recorded factory noise at SNRs from -5dB to +15dB
- Room impulse responses: Convolution with impulse responses measured at different positions on the factory floor
- Speed perturbation: 0.9x to 1.1x playback speed to handle speaking rate variation
- SpecAugment: Time and frequency masking on mel spectrograms
The noise augmentation at negative SNR values is important. At -5dB SNR, the noise is louder than the speech. Training on these extreme conditions forces the model to extract the most robust features.
training_config = {
"optimizer": "AdamW",
"learning_rate": 3e-4,
"weight_decay": 0.01,
"scheduler": "cosine_with_warmup",
"warmup_steps": 1000,
"max_steps": 50000,
"batch_size": 64,
"noise_snr_range": [-5, 15],
"speed_perturb_range": [0.9, 1.1],
"specaugment": {
"freq_mask_param": 15,
"time_mask_param": 20,
"num_freq_masks": 2,
"num_time_masks": 2
},
"mixup_alpha": 0.3,
"label_smoothing": 0.1
}
Model Compression for Deployment
The trained model at FP32 is 8.4MB. Still too large for fast inference at 4W. We applied three compression stages.
Stage 1: Structured pruning. We removed 30% of channels based on L1-norm importance scores. This reduced the model to 1.5M parameters with only 0.3% accuracy drop.
Stage 2: INT8 quantization via TensorRT. Post-training quantization with a calibration dataset of 5,000 representative samples.
import tensorrt as trt
def build_int8_engine(onnx_path, calibration_data):
builder = trt.Builder(trt.Logger(trt.Logger.WARNING))
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, trt.Logger(trt.Logger.WARNING))
with open(onnx_path, 'rb') as f:
parser.parse(f.read())
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = EntropyCalibrator(calibration_data)
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 28)
engine = builder.build_serialized_network(network, config)
return engine
Stage 3: Weight sharing. We applied k-means clustering to weight matrices, sharing weights within clusters. This provided an additional 2x compression for storage.
Final model size: 1.8MB. Inference time on Jetson Nano: 18ms per 3-second audio window. Power consumption during inference: 3.2W.
Deployment on Jetson
The deployment architecture runs as a lightweight C++ application with the TensorRT engine. The DSP pipeline is implemented in CUDA for GPU acceleration of the spectral subtraction step.
┌─────────────────────────────────────────────┐
│ Audio Capture (ALSA) │
│ 16kHz, 16-bit, mono │
├─────────────────────────────────────────────┤
│ Ring Buffer (3 seconds) │
├─────────────────────────────────────────────┤
│ DSP Pipeline (CUDA) │
│ Highpass → Spectral Sub → VAD → Mel → CMVN │
├─────────────────────────────────────────────┤
│ Conformer-Tiny (TensorRT INT8) │
│ 18ms inference │
├─────────────────────────────────────────────┤
│ Command Post-Processing │
│ Confidence threshold → Debounce → Command │
├─────────────────────────────────────────────┤
│ Machine Interface (Modbus/TCP) │
│ < 5ms command transmission │
└─────────────────────────────────────────────┘
The post-processing layer is important. We apply a confidence threshold of 0.85 to prevent false activations. A debounce window of 500ms prevents the same command from being recognized twice. And a "wake word" detection stage (a simple 200KB model) gates the full command recognition pipeline, so the larger model only runs when someone is actually speaking to the machine.
Power Management
// Adaptive power management based on activity
typedef enum {
POWER_SLEEP, // 0.5W - wake word detection only
POWER_LISTEN, // 2.1W - DSP pipeline active
POWER_INFERENCE, // 3.2W - full model running
} PowerState;
void update_power_state(AudioState* state) {
if (!state->wake_word_detected) {
set_gpu_clock(CLOCK_MIN); // 76.8 MHz
state->power = POWER_SLEEP;
} else if (!state->speech_detected) {
set_gpu_clock(CLOCK_MED); // 307.2 MHz
state->power = POWER_LISTEN;
} else {
set_gpu_clock(CLOCK_MAX); // 921.6 MHz
state->power = POWER_INFERENCE;
}
}
The device spends 85% of its time in sleep mode, 12% in listen mode, and only 3% in full inference mode. The average power consumption over an 8-hour shift is 0.9W. The battery lasts 6 hours during shift transitions, well above the 2-hour requirement.
Results
After 3 months of deployment across 12 machines:
| Metric | Target | Achieved | |--------|--------|----------| | Command accuracy (clean) | 90% | 94.2% | | Command accuracy (95dB noise) | 80% | 87.1% | | False activation rate | < 1/hour | 0.3/hour | | End-to-end latency | < 200ms | 62ms | | Average power consumption | < 4W | 0.9W | | Battery life during transitions | > 2 hours | 6.1 hours | | Model size | < 5MB | 1.8MB |
The 87% accuracy in 95dB noise exceeded expectations by 7 percentage points. The key factors were the aggressive noise augmentation during training (especially the negative SNR conditions) and the spectral subtraction DSP pipeline.
The latency of 62ms total (including audio capture, DSP, inference, and machine command) meant that operators perceived commands as instantaneous. Multiple operators reported that the system felt faster than pressing physical buttons, because voice commands eliminated the hand movement required to reach the control panel.
Lessons Learned
The DSP pipeline is more important than the model. Roughly 60% of our accuracy gain came from the signal processing, not from model architecture. If your input is garbage, no amount of neural network complexity will save you.
Closed vocabulary changes the problem fundamentally. Do not use a general-purpose speech model for a specific-purpose task. The constraint of 47 commands allowed us to build a model 150x smaller than Whisper Tiny with better accuracy on our specific task.
Power management is a feature, not an afterthought. The three-stage power management system gave us 6x better battery life than running the full pipeline continuously. Design for the power budget from day one, not as an optimization pass at the end.
Collect training data in the deployment environment. We recorded all training data on the actual factory floor during normal operations. Teams that record training data in quiet offices and then deploy in noisy environments are setting themselves up for failure.
Edge AI is not about making cloud models smaller. It is about rethinking the problem from the constraints up. The constraints are not limitations. They are the design parameters that lead to better solutions.
Discussion (2)
Solid technical depth. This is the kind of content that makes me actually trust a vendor — they clearly know what they're talking about because nobody writes at this level of specificity without real experience.
That's the goal — we write about what we've actually done, not what we've read about. Every article is based on real deployment experience, real numbers, real failures. Thanks for reading.