Back to Blog
CASE STUDY
10 min read
November 20, 2025

Anomaly Detection in Financial Services: Rules vs ML

A practical comparison of rule-based and ML-based anomaly detection for financial transactions. When to use each, how to combine them, and why the answer is almost never pure ML.

Rules catch the fraudsters you know about. ML catches the ones you do not. You need both.

The False Dichotomy

Every conversation I have with financial services clients about anomaly detection eventually arrives at the same question: should we use rules or ML? The framing implies a choice. It is the wrong framing.

Rules and ML are not competing approaches. They are complementary layers in a detection system, each with distinct strengths that cover the other's weaknesses. The real engineering challenge is not picking one over the other. It is designing the interface between them so they reinforce each other instead of creating blind spots.

I have built anomaly detection systems for three financial services clients over the past four years. In every case, the production system used both rules and ML. In every case, the initial client expectation was that ML would replace their rules. In every case, we had to walk them back from that expectation and explain why the rules were not just legacy baggage but a critical component of the system.

Here is why.

What Rules Do Well

Rule-based detection systems have been the backbone of financial fraud detection for decades. They are simple to understand, simple to audit, and simple to explain to regulators. A rule that says "flag any wire transfer over $50,000 to a country on the sanctions list" is unambiguous. Everyone, from the compliance officer to the regulator to the judge, understands exactly what it does and why.

Rules excel in several specific areas:

Known patterns with clear thresholds. Structuring transactions to stay below reporting thresholds (smurfing) follows predictable patterns. A rule that detects multiple transactions from the same account that sum to just below $10,000 within a 24-hour window catches the bulk of basic structuring attempts. No ML needed.

Regulatory requirements. Many detection requirements are mandated by regulation with specific criteria. OFAC screening, for example, requires checking names against a sanctions list. This is a lookup, not a prediction. Trying to frame regulatory requirements as ML problems adds complexity without benefit.

Velocity checks. Simple thresholds on transaction frequency catch a surprising amount of fraud. "More than 5 card-not-present transactions in 10 minutes" is a rule that fires on compromised cards being tested by fraudsters. It is fast to compute, trivial to implement, and catches a real attack pattern.

Business logic constraints. Some anomalies are definitionally anomalous based on business rules. A dormant account suddenly initiating high-value international transfers is suspicious by policy definition, not by statistical inference. Rules encode these policy definitions directly.

Here is a simplified rule engine architecture we have used:

from dataclasses import dataclass
from typing import Callable
from datetime import timedelta

@dataclass
class Rule:
    name: str
    description: str
    evaluate: Callable
    severity: str  # "block", "review", "flag"
    regulatory_reference: str | None = None

class RuleEngine:
    def __init__(self):
        self.rules: list[Rule] = []

    def add_rule(self, rule: Rule):
        self.rules.append(rule)

    def evaluate(self, transaction, context) -> list[dict]:
        alerts = []
        for rule in self.rules:
            try:
                triggered, details = rule.evaluate(transaction, context)
                if triggered:
                    alerts.append({
                        "rule_name": rule.name,
                        "severity": rule.severity,
                        "details": details,
                        "regulatory_ref": rule.regulatory_reference,
                    })
            except Exception as e:
                # Rules must never crash the pipeline
                log.error(f"Rule {rule.name} failed: {e}")
                metrics.increment("rule_evaluation_errors",
                                  tags={"rule": rule.name})
        return alerts

# Example rules
def structuring_rule(txn, ctx):
    window = timedelta(hours=24)
    recent_txns = ctx.get_transactions(
        txn.account_id, since=txn.timestamp - window
    )
    total = sum(t.amount for t in recent_txns) + txn.amount
    individual_under = all(
        t.amount < 9500 for t in recent_txns
    ) and txn.amount < 9500

    triggered = total > 9000 and total < 11000 and individual_under
    return triggered, {"total_amount": total, "txn_count": len(recent_txns) + 1}

def velocity_rule(txn, ctx):
    window = timedelta(minutes=10)
    recent = ctx.get_transactions(
        txn.card_id, since=txn.timestamp - window
    )
    triggered = len(recent) >= 5
    return triggered, {"txn_count": len(recent), "window_minutes": 10}

Where Rules Break Down

Rules fail in two predictable ways:

Combinatorial explosion. As fraud patterns evolve, organizations add more rules. After a few years, you have 3,000 rules, many of which interact in unexpected ways. Two rules that are individually reasonable can produce absurd results when combined. I audited a system with 2,800 rules where 340 of them had never fired in production. Another 180 had fired but were overridden 98% of the time by analysts. The rule base had become unmaintainable.

Novel patterns. Rules only catch what you have seen before. When fraudsters develop a new technique, there is no rule for it until someone identifies the pattern, designs a rule, tests it, and deploys it. That cycle takes weeks to months. During that window, the new pattern is invisible.

Subtle statistical anomalies. Some anomalies are not about individual transactions crossing thresholds. They are about patterns across many transactions that are individually normal. A customer who gradually shifts their transaction profile over six months, with each individual change being small and unremarkable, will never trigger threshold-based rules. But the cumulative shift may be highly anomalous.

This is where ML earns its place.

What ML Does Well

ML-based anomaly detection excels at finding patterns that are difficult or impossible to express as explicit rules. The three approaches I have deployed most frequently in financial services are:

Autoencoder-Based Anomaly Detection

Train an autoencoder to reconstruct normal transaction patterns. Transactions that the model struggles to reconstruct are anomalous by definition, they do not look like anything the model learned from normal data.

import torch
import torch.nn as nn

class TransactionAutoencoder(nn.Module):
    def __init__(self, input_dim: int = 42, latent_dim: int = 8):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 32),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, latent_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 16),
            nn.ReLU(),
            nn.Linear(16, 32),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(32, input_dim),
        )

    def forward(self, x):
        latent = self.encoder(x)
        reconstructed = self.decoder(latent)
        return reconstructed

    def anomaly_score(self, x):
        reconstructed = self.forward(x)
        mse = torch.mean((x - reconstructed) ** 2, dim=1)
        return mse

The advantage of autoencoders is that they do not require labeled fraud data. You train on normal transactions and anything that deviates from learned normal patterns gets a high anomaly score. This catches novel fraud patterns that no rule anticipates.

The disadvantage is interpretability. When the model flags a transaction, explaining why requires additional analysis of which features contributed most to the reconstruction error. This matters enormously in financial services where regulators expect explainable decisions.

Isolation Forest for Outlier Detection

For tabular transaction features, Isolation Forest provides a fast, interpretable anomaly detection method. It works by randomly partitioning the feature space and measuring how many partitions are needed to isolate each data point. Anomalies are isolated quickly because they sit in sparse regions of the feature space.

from sklearn.ensemble import IsolationForest
import numpy as np

class TransactionOutlierDetector:
    def __init__(self, contamination: float = 0.001):
        self.model = IsolationForest(
            n_estimators=200,
            contamination=contamination,
            random_state=42,
            n_jobs=-1,
        )
        self.feature_names = None

    def fit(self, features, feature_names: list[str]):
        self.feature_names = feature_names
        self.model.fit(features)

    def score(self, features) -> np.ndarray:
        # Returns anomaly scores; more negative = more anomalous
        return self.model.decision_function(features)

    def explain(self, features, idx: int) -> dict:
        """
        Approximate feature importance for a single prediction
        by measuring score change when each feature is replaced
        with its median value.
        """
        baseline_score = self.model.decision_function(
            features[idx:idx+1]
        )[0]
        importances = {}
        for i, name in enumerate(self.feature_names):
            perturbed = features[idx:idx+1].copy()
            perturbed[0, i] = np.median(features[:, i])
            new_score = self.model.decision_function(perturbed)[0]
            importances[name] = baseline_score - new_score
        return dict(sorted(
            importances.items(),
            key=lambda x: abs(x[1]),
            reverse=True
        ))

Temporal Pattern Models

For detecting gradual behavioral shifts, we use sequence models that learn customer transaction patterns over time. An LSTM or Transformer model trained on sequences of customer transactions learns what "normal" looks like for each customer segment. When a customer's recent behavior diverges from their historical pattern, the model assigns a high anomaly score.

This catches the slow-drift scenario that rules miss entirely: a customer whose transaction profile changes gradually over months in a way that is individually unremarkable but collectively suspicious.

The Hybrid Architecture

In production, every system I have built uses a layered architecture where rules and ML operate in sequence:

Layer 1: Hard rules (blocking). Sanctions screening, velocity limits, regulatory thresholds. These run first and can block transactions immediately. No ML override. Zero tolerance for false negatives on regulatory requirements.

Layer 2: ML scoring (scoring). The autoencoder, isolation forest, and temporal models each produce an anomaly score. These scores are combined into a composite risk score using a calibrated ensemble.

Layer 3: Soft rules (enrichment). Additional rules that add context to the ML score. "This account was opened less than 30 days ago" is not a reason to block a transaction, but it raises the risk context for analyst review.

Layer 4: Decision logic (routing). Based on the combined rule alerts and ML scores, route the transaction to: approve, block, or queue for human review.

class HybridDetectionPipeline:
    def __init__(self, rule_engine, ml_scorer, decision_router):
        self.rule_engine = rule_engine
        self.ml_scorer = ml_scorer
        self.decision_router = decision_router

    def evaluate(self, transaction, context) -> dict:
        # Layer 1: Hard rules
        rule_alerts = self.rule_engine.evaluate(transaction, context)
        blocking_alerts = [a for a in rule_alerts if a["severity"] == "block"]
        if blocking_alerts:
            return {
                "decision": "block",
                "reason": "rule_block",
                "alerts": blocking_alerts,
            }

        # Layer 2: ML scoring
        features = self.extract_features(transaction, context)
        ml_scores = self.ml_scorer.score(features)

        # Layer 3: Soft rules for context enrichment
        soft_alerts = [a for a in rule_alerts if a["severity"] == "flag"]

        # Layer 4: Decision routing
        decision = self.decision_router.route(
            ml_scores=ml_scores,
            rule_alerts=rule_alerts,
            soft_alerts=soft_alerts,
            transaction=transaction,
        )

        return decision

The Feedback Loop Problem

The most underappreciated challenge in financial anomaly detection is the feedback loop. You train your ML model on historical data where fraud is labeled based on what your existing system detected. But your existing system has blind spots. The fraud it missed is labeled as legitimate in your training data, because nobody flagged it.

This means your ML model inherits the blind spots of the system that generated its labels. It learns to detect the same fraud patterns that rules already catch, while remaining blind to the novel patterns that rules also missed.

Breaking this cycle requires:

  • Unsupervised anomaly detection that does not depend on fraud labels
  • Random sampling of approved transactions for manual review to discover undetected fraud
  • External fraud reports (chargebacks, law enforcement notifications) fed back into the training data
  • Regular red-team exercises where internal teams attempt to design fraud patterns that evade the current system

Practical Recommendations

After three deployments in this space, here is what I tell every financial services client:

Do not rip out your rules. They are battle-tested, explainable, and regulators understand them. Layer ML on top.

Start with unsupervised methods. Autoencoders and isolation forests do not need fraud labels. They find things that look different from normal. This sidesteps the feedback loop problem for the initial deployment.

Invest in feature engineering. The features matter more than the model. Transaction velocity, amount deviation from customer baseline, time-of-day patterns, merchant category patterns, geographic patterns. Good features make simple models powerful.

Build the explanation layer from day one. Analysts will not act on a score without understanding why it was assigned. Every anomaly score must come with a human-readable explanation of the contributing factors.

Measure false positive rate obsessively. In financial services, false positives have real costs: blocked legitimate transactions, angry customers, analyst time spent reviewing non-issues. A model with 99% detection rate and 5% false positive rate on a million daily transactions generates 50,000 false alerts per day. That is not a detection system. It is a noise generator.

The goal is not maximum detection. It is optimal detection, the point where you catch the most fraud with the fewest false positives and the clearest explanations. That point is almost always a hybrid system where rules and ML each handle what they do best.

Discussion (2)

HR
risk_head_bankHead of Risk · Banking1 week ago

Our rule-based system has 1,200+ rules and generates 600 alerts/day. 94% are false positives. The compliance team won't let us remove any rules because 'what if we miss something.' Meanwhile, our analysts spend 80% of their time chasing ghosts. The hybrid approach you describe — rules for compliance, ML for detection — is the first architecture I've seen that might actually get past our compliance committee.

M
Mostafa DhouibAuthor1 week ago

The 'can't remove rules' problem is universal in regulated industries. The hybrid approach works because you're not removing rules — you're adding a layer that re-ranks the alerts by ML confidence score. Rules still fire (compliance happy), but analysts see the ML-ranked queue first (efficiency up 5-10x). The SHAP explainability layer is what sells this to compliance: for every ML flag, the system shows exactly which features drove the score. Regulators love this because it's more auditable than 'rule 847 fired because threshold X was exceeded.'

M
Mostafa DhouibFounder & ML Engineer at Opulion

Facing a similar challenge?

Tell us about your problem. We'll respond with an honest technical assessment within 24 hours.