Production Systems Engineering

The hard part wasnever building it.It was making it work where it has to run.

We design, build, and operate the production systems that mission-critical operations depend on. Bare-metal fleet platforms managing tens of thousands of servers. Machine learning on hardware at the edge, air-gapped and latency-bound. The systems where the field looks nothing like the lab.

Most of what we build is powered by AI. Almost none of it is only AI. The model, when there is one, is the small part. The system around it is where the failures live.

0+
systems
0
verticals
0
nodes engineered
The 95 percent

The system is never where you are looking.

In 2015, Google published a finding the industry still has not absorbed. In a mature production system, the core, the thing everyone fixates on, is roughly five percent of the work. The other ninety-five percent is everything around it. The pipelines. The protocols. The infrastructure. The boundary between where the system was built and where it runs.

Every team reaches for that five percent anyway. They review the architecture, tune the model, read the metrics. The system keeps failing, because the failure was never in the part they were looking at. That ninety-five percent is the entire job. It is where we work.

composition of a production system
the model5%
everything around it95%
pipelines · protocols · drivers · configuration · the build-to-deploy boundary
Failure, repeated

The systems are different. The reason they fail is the same.

01

The boundary between built and deployed

Every system is constructed in one place and runs in another. The two environments start identical and drift apart, quietly, until the gap surfaces as a failure no one can explain. This is the single most common place serious systems break.

02

Environment divergence

The same code produces different behavior on different ground. Two hardware generations return different numbers for identical inputs. A precision default changes the math without a line of code changing.

03

The scale cliff

A system that is correct at a hundred nodes is a different system at twenty-five thousand. Assumptions that held at small numbers collapse, and the failure does not appear until the scale does.

04

Silent drift

Pipelines diverge. Fleets grow. Inputs shift. Nothing throws an error. The metrics that should catch it are computed on the wrong side of the gap, so they stay green while the system degrades underneath them.

05

Invisible from the inside

Every pattern above survives months of effort for one reason. Everyone working on the system stands on the same side of the gap. What is wrong is only visible from outside.

the boundary, measured

Two pipelines that read as one line, until they drift. The gap is the failure. We run the comparison nobody ran, find the skew, and close it.

What we do

We work the full life of a system. You can enter at any stage.

01 / 05

Diagnose

A system is failing and the people closest to it cannot see why. We can, because we come from outside the boundary they are all standing inside.

The systems we work on

Not all of them are AI systems. Calling them that would be the convenient lie. Here is the honest range.

Infrastructure platforms

Server fleet management. GPU-fleet telemetry. Systems that discover, monitor, configure, and control physical hardware across tens of thousands of nodes over Redfish, IPMI, SNMP, and SSH. The platform is the product.

RedfishIPMISNMPSSH

Machine learning on hardware

Inference running on physical devices at the edge. Air-gapped, latency-bound, with no cloud to fall back on. Where the hardware architecture, the precision mode, and the millisecond budget are as much the problem as the model.

EdgeAir-gappedFP16

Production machine learning

Fraud, risk, recommendation, diagnostic imaging, search. Models that pass every test and fail in the field. The cause is almost never the model. It is the boundary between training and serving.

TrainingServingSkew

Automation and infrastructure

Orchestration, protocol integration, telemetry, fleet monitoring, with no model anywhere in it. Sometimes the right system has no AI in it at all. We will tell you when that is the case.

OrchestrationTelemetry
The whole stack

We work the whole stack, from the model to the metal.

Almost everyone in this market works at a single altitude. They know the model. Ask them about the precision mode on the hardware, the Redfish call to the baseboard controller, the service moving telemetry off twenty-five thousand servers, and the conversation ends.

That range is rare, and it is the reason we find what others miss. The failure almost always lives in a layer they never thought to open.

5 percent. the model
the decision layer everyone fixates on
pipelines
where training and serving quietly drift apart
protocols
Redfish, IPMI, SNMP, treated as the hardware behaves
drivers
the code that talks to the baseboard controller
firmware
varying across a fleet that was meant to be uniform
silicon
the metal, where the millisecond budget is decided
Where we work

We work where the consequence of failure is real. Not a number on a dashboard. A lost contract, a delayed clearance, a stopped line, a regulatory finding.

Selected work

The work, in the only terms that matter: what was at stake, and what changed.

Infrastructure & Fleet PlatformsArchitect + Build

A full-lifecycle platform to discover, monitor, and control server fleets at up to 25,000 nodes across a mixed NVIDIA and AMD GPU estate.

25,000nodes, mixed GPU estate
Industrial AutomationArchitect + Build

A predictive-maintenance system tuned to the cost of being wrong, raising warnings the crew believes with enough lead time to act.

cost-weighted · actionable lead time · calibrated for trust
LLM SystemsDiagnose + Harden

A retrieval system rebuilt to know when it is wrong: grounded, cited, willing to abstain, with a score on every stage.

grounded · faithfulness-checked · abstains when unsure
How we engage

You tell us the situation.
We tell you what we see.

We take on a small number of engagements at a time. The work is deep and it is led personally. No pitch deck. No sales call to survive. If we are the wrong firm for the problem, we will tell you that too.

mostafa@opulion.dev · Response within 24 hours · By inquiry