Technical Deep-Dive

Building Hyperion-X: A 50,000 TPS Payment Switch for Africa

⚡

Datacraft Engineering

June 12, 2026 12 min read

A payment made in Nairobi to a recipient in Kampala, two cities 500 kilometres apart, both in East Africa, both served by the same mobile money networks in many cases, has a reasonable chance of routing through a correspondent bank in London before it arrives. The transaction leaves Kenya, crosses to the UK, and comes back. Round-trip latency: 1–3 business days. Round-trip cost: 4–8% of transaction value, depending on how many correspondent hops are in the chain.

I want to be precise about what this means economically. A Kenyan supplier invoicing a Ugandan buyer for $10,000 USD equivalent loses $400–800 to infrastructure that adds zero value to the transaction. The money did not need to go to London. There is no commercial reason for it. It goes to London because the correspondent banking rails are the path of least resistance for institutions that have not built direct bilateral settlement relationships, and building those relationships is expensive. So the cost lands on the businesses instead.

That absurdity, the geographic nonsense of Nairobi→London→Kampala, is the founding premise of Hyperion-X. We built a payment switch designed from the ground up for the African protocol landscape, the African regulatory environment, and the African transaction economics. This post is about what that actually took. The honest version, not the press release version.

The Protocol Decision: ISO 8583, ISO 20022, and the Translator We Had to Build

We started with ISO 8583. That was not a controversial choice: if you are building payment infrastructure that needs to talk to African banks, ISO 8583 is the language they speak. It has been the dominant card and interbank messaging protocol since the 1980s. Binary, field-bitmap, fast, and deeply embedded in every core banking system on the continent. Our initial Hyperion-X implementation was ISO 8583 native, and for the first 18 months that was fine.

Then CBK started talking seriously about ISO 20022 for the PAPSS integration. ISO 20022 is the XML/JSON-based messaging standard that SWIFT, the BIS, and most central banks are converging on for the next generation of payment infrastructure. PAPSS, the Pan-African Payment and Settlement System, mandates ISO 20022 for its participants. Any commercial bank that wants to connect to PAPSS has to speak ISO 20022. Any payment switch sitting in the middle needs to speak both.

We made the decision to implement both and build an in-flight translator between them. At the time, I estimated it at 6 weeks of work. It took four months and consumed two of our best engineers for most of that period. Not because the translation is technically hard, the field mappings between 8583 and 20022 are well-documented, but because the semantic mismatch is real and subtle.

ISO 8583 has a concept of a transaction amount and a currency code. ISO 20022 has a richer model: instructed amount, settlement amount, equivalent amount, and the currency of each, with separate fields for FX rate, FX rate type, exchange rate contract reference, and the agreed rate date. When you translate a cross-border ISO 8583 transaction into ISO 20022, you have to make decisions about where the FX happened, at what rate, and under what contract, information that the original 8583 message may not contain. We had to define canonical defaults and build fallback logic for the cases where the source data was ambiguous. That is unglamorous work and there is no shortcut through it.

The alternative, two separate switches with a routing layer between them, would have been architecturally cleaner in some ways but operationally worse. A single authoritative transaction store is worth a lot when you are debugging settlement discrepancies at 2am. We made the right call, but I will not pretend the timeline hit was painless.

Post-Quantum Cryptography: We Made the Call Early and I Would Make It Again

Hyperion-X implements CRYSTALS-Kyber for key encapsulation and CRYSTALS-Dilithium for digital signatures. We made that decision in late 2023, before NIST had finalized the standards (NIST FIPS 203 and 204 were published in August 2024). At the time, several people, including people I respect, thought we were being paranoid.

The argument for paranoia is simple: harvest-now, decrypt-later. An adversary with sufficient resources does not need a quantum computer today to threaten the confidentiality of data encrypted today. They need to record encrypted traffic now and decrypt it when the hardware exists. For payment data, account numbers, transaction histories, counterparty relationships, the privacy horizon is long. A transaction made today could be relevant for compliance, litigation, or fraud investigation in 2032. If a state actor or a well-resourced criminal organization is harvesting encrypted payment traffic now, they could be decrypting it in 6–10 years.

The financial services security community calls this "Q-Day" and there is legitimate disagreement about when it arrives; estimates range from 8 to 20 years. We were not willing to bet customer data on "probably not in the next decade." The cost of implementing post-quantum crypto before it is strictly necessary is some engineering time and a modest performance overhead. The cost of not implementing it, if Q-Day arrives sooner than the optimists expect, is catastrophic and irreversible.

The integration into our Rust core looks like this:

use pqcrypto_kyber::kyber768;
use pqcrypto_dilithium::dilithium3;
use pqcrypto_traits::kem::{PublicKey, SecretKey, SharedSecret};
use pqcrypto_traits::sign::{SignedMessage, VerificationError};

pub struct HyperionKeyPair {
    pub kem_pk:  kyber768::PublicKey,
    pub kem_sk:  kyber768::SecretKey,
    pub sig_pk:  dilithium3::PublicKey,
    pub sig_sk:  dilithium3::SecretKey,
}

impl HyperionKeyPair {
    pub fn generate() -> Self {
        let (kem_pk, kem_sk) = kyber768::keypair();
        let (sig_pk, sig_sk) = dilithium3::keypair();
        Self { kem_pk, kem_sk, sig_pk, sig_sk }
    }

    /// Encapsulate a session key for a remote participant.
    /// Returns (ciphertext, shared_secret) — ciphertext goes on the wire,
    /// shared_secret is used as the symmetric session key.
    pub fn encapsulate_for(remote_pk: &kyber768::PublicKey)
        -> (kyber768::Ciphertext, kyber768::SharedSecret)
    {
        kyber768::encapsulate(remote_pk)
    }

    /// Sign an authorization message. We sign the canonical ISO 20022
    /// message digest, not the raw XML, to avoid canonicalization attacks.
    pub fn sign_auth(&self, msg_digest: &[u8]) -> dilithium3::SignedMessage {
        dilithium3::sign(msg_digest, &self.sig_sk)
    }
}

The performance overhead of Kyber-768 key encapsulation versus RSA-2048 is negligible at our transaction volumes; Kyber is actually faster. Dilithium signatures are larger than ECDSA (2.5 KB vs ~72 bytes), which adds some wire overhead on high-frequency channels, but it is well within our latency budget. We use hybrid mode for the transition period: Dilithium signature plus ECDSA signature on each auth message, so any counterparty that cannot yet verify Dilithium still has a valid classical signature to check.

On the harvest-now, decrypt-later threat: Financial transaction records have a 7–10 year regulatory retention requirement under Kenya's Banking Act and equivalent legislation across the continent. Any adversary recording ciphertext today gets a second attempt at every transaction in that archive when sufficiently capable quantum hardware arrives. "Q-Day is far away" is not a defense when the attack vector is already active and the data has a decade-long exposure window.

Formal Verification with TLA+: The Bug That Changed Our Process

We had a bug in settlement reconciliation. It cost us three days to diagnose, and it was the kind of bug that is invisible until it isn't: a race condition in the authorization state machine where a timeout response from the downstream bank and a delayed approval could arrive in the wrong order, causing the transaction to be marked declined in our ledger while the bank had already deducted the funds.

In practice it was caught by our end-of-day reconciliation before any customer was affected. But the fact that it existed, that we had shipped code with a race that our test suite did not catch, was sufficiently alarming that we changed our process. We wrote the entire authorization state machine in TLA+.

For readers unfamiliar: TLA+ (Temporal Logic of Actions) is a formal specification language developed by Leslie Lamport. You describe your system's states and transitions as a mathematical model, and a tool called the model checker (TLC) exhaustively explores all possible execution traces to verify that your invariants, the properties you claim always hold, are never violated. It is not testing. Testing samples the execution space. Model checking explores it completely, up to the bounds you specify.

Here is a simplified sketch of the authorization state machine spec:

---- MODULE AuthStateMachine ----
EXTENDS Naturals, Sequences, FiniteSets

CONSTANTS TxIDs, BankResponses

VARIABLES
    tx_state,     \* TxID -> state
    bank_queue,   \* pending messages to downstream bank
    ledger        \* TxID -> credited | debited | none

AuthStates == {"pending", "authorized", "declined", "timeout", "reversed"}

TypeInvariant ==
    /\ tx_state \in [TxIDs -> AuthStates]
    /\ ledger   \in [TxIDs -> {"credited", "debited", "none"}]

\* Critical: a transaction marked declined must never have a debit in the ledger.
\* This is the invariant the race condition violated.
DeclinedNeverDebited ==
    \A tx \in TxIDs :
        tx_state[tx] = "declined" => ledger[tx] # "debited"

\* A reversed transaction must have been previously authorized.
ReversalRequiresAuth ==
    \A tx \in TxIDs :
        tx_state[tx] = "reversed" =>
            \E prior \in {"authorized"} : TRUE  \* encoded in transition guards

====

The model checker found two invariant violations before we shipped the rewritten state machine. Both were edge cases in timeout handling that our integration tests had not exercised: one involving a very specific interleaving of network partition recovery and bank callback retry, another involving double-processing of a reversal under high queue depth. Neither would have been easy to find by inspection. TLC found them in 40 minutes of model checking time.

I want to be honest about what TLA+ gives you and what it does not. The model is not the code. Code that TLA+ says should be correct can still have bugs, buffer overflows, integer truncation, off-by-one errors in serialization, that are outside the scope of what the spec describes. What it does give you is strong guarantees about a specific class of bugs: races, deadlocks, lost messages, and invariant violations under arbitrary interleavings. That class of bugs is exactly what kills payment systems. The settlement race that cost us three days was precisely the kind of thing TLC would have found if we had written the spec first.

We now write TLA+ specs for every new stateful component before writing any implementation code. The investment in learning curve (roughly two weeks for an engineer with no prior exposure) pays back immediately on the first component spec.

The M-Pesa Integration: Our Dumbest Early Decision

Daraja is a REST API. Our core is message queues: Kafka for internal routing, ISO 8583 framing on the bank-facing side. The translation layer between them was where we made the mistake.

M-Pesa works like this: you initiate an STK push via Daraja, the customer confirms on their handset, Safaricom processes the transaction, and then Safaricom sends a callback to a URL you registered at the time of the STK push. The callback contains the transaction result. Your system confirms receipt with a 200 OK. If you do not return 200, Safaricom retries, up to 4 times with exponential backoff.

In our first implementation, we were treating those callbacks as fire-and-forget on our side. Callback arrives, we process it, we credit the account, we return 200. The bug: under load, our callback processor was occasionally timing out before it finished writing the ledger entry, and returning a non-200 status as a result. Safaricom saw that as a failed delivery and retried. We processed the retry. We credited the account again.

We caught this in staging when our test suite flagged a balance discrepancy. But the root cause took a day to find because nothing in our logs obviously indicated "this account was credited twice." The idempotency key was not being checked before applying the credit; it was only being logged after. One line of logic in the wrong order.

The fix is not complicated in principle:

// Before applying any callback: check idempotency store first.
// The idempotency key is Safaricom's CheckoutRequestID, which is stable
// across retries for the same originating STK push.

async fn handle_mpesa_callback(
    callback: MpesaCallback,
    idempotency_store: &IdempotencyStore,
    ledger: &Ledger,
) -> Result<HttpResponse> {
    let key = &callback.body.stkCallback.checkoutRequestID;

    // Atomic check-and-set: returns Err if key already exists.
    match idempotency_store.claim(key).await? {
        IdempotencyResult::AlreadyProcessed(prior_result) => {
            // Safaricom is retrying something we already handled.
            // Return 200 so they stop retrying, but do NOT re-credit.
            tracing::warn!(key, "duplicate mpesa callback — returning cached result");
            return Ok(HttpResponse::Ok().json(prior_result));
        }
        IdempotencyResult::Claimed => {
            // First time we've seen this key. Safe to proceed.
        }
    }

    // Now apply the credit under a transaction.
    let result = ledger.apply_credit(&callback).await?;
    idempotency_store.record(key, &result).await?;
    Ok(HttpResponse::Ok().json(result))
}

The idempotency store uses Redis with a TTL of 48 hours, long enough to outlast any realistic Safaricom retry window and short enough that it does not grow unboundedly. The claim operation is an atomic SETNX under the hood. This is not a novel pattern; it is standard distributed systems hygiene. We knew the theory. We failed to apply it correctly in the initial implementation because we were thinking about the happy path, the callback arrives once, we process it, done, rather than explicitly designing for the retry path.

Lesson from the M-Pesa incident: Any external callback mechanism that has a retry policy must be treated as at-least-once delivery from day one. "We'll add idempotency later" is not a plan: under load, the retry window and your processing latency will overlap sooner than you expect. The idempotency check must happen before any state mutation, not after logging.

AML at Microsecond Scale: You Cannot Block the Transaction Thread

Anti-money laundering at 50,000 TPS is a real-time systems problem dressed up as a compliance problem. At that throughput, you have 20 microseconds per transaction if you are processing serially (you are not, but the arithmetic sets the ceiling). An ML model inference that takes 200ms is not a transaction-path component. It is a post-hoc analysis component. You have to design accordingly.

Our AML architecture has two layers with different latency budgets and different failure modes:

Layer 1, synchronous rules engine (<0.5ms, blocking). Hard rules that fire immediately and halt the transaction. Sanctioned-entity match against OFAC/EU/UN consolidated lists, velocity checks (more than N transactions from this account in the last 60 seconds), hard amount thresholds, structuring patterns (9 transactions just under the reporting threshold in 24 hours). This layer runs in-process, compiled, no network calls. If it fires, the transaction is declined with an appropriate response code. The rules are conservative: false positives here are a customer service problem, but the alternative is letting a sanctioned-entity transaction go through while the ML model is thinking about it.

Layer 2, async ML scoring (50–300ms, non-blocking). A graph neural network trained on historical transaction patterns runs in parallel with the transaction authorization. It does not gate the authorization decision. It produces a suspicion score that is written to the transaction record. If the score exceeds a threshold, the transaction is flagged for human review, typically within 30 seconds of completion, well within the regulatory reporting window.

The false positive rate on the ML layer is our biggest ongoing operational pain point. At 50,000 TPS, a 0.1% false positive rate is 50 transactions per second landing in the review queue, 180,000 per hour. Human review at that volume is not feasible. We use a tiered escalation model: ML flags go to a second automated model (a simpler, faster classifier that acts as a gatekeeper for the review queue), and only the transactions that survive that second filter land in front of a human analyst. The second model is calibrated very aggressively for false-negative minimization: we would rather send 10 borderline cases to a human than miss one genuine SAR-worthy transaction.

We are still tuning this. The precision-recall tradeoff in AML ML is genuinely hard because the ground truth labels, whether this transaction is actually money laundering, are sparse, delayed (prosecutions happen years after transactions), and systematically biased toward the cases that were caught. We do not know the false negative rate on our ML model because we cannot observe the fraudulent transactions we are missing. Nobody can. This is a fundamental epistemic limitation of the problem, and anyone who tells you their AML model's false negative rate is confidently estimated is not being straight with you.

What We Got Wrong: The Active-Active Multi-Region Premature Optimization

I want to be direct about this because the temptation to make this mistake is strong and the payment infrastructure literature does not warn about it loudly enough.

We designed Hyperion-X as an active-active multi-region system from the start. Two availability zones in Nairobi, one in Lagos, with synchronous replication and automatic failover. The architecture is correct for a system processing 50,000 TPS across multiple African markets. It is also deeply complex: distributed consensus on the transaction log, cross-region idempotency store synchronization, split-brain detection, geo-routing with automatic fallback.

We shipped the first production deployment of Hyperion-X nine months later than we needed to, and approximately four of those months were spent on multi-region complexity that we did not need yet. At initial production launch, we were processing 2,000 TPS peak. We did not need active-active multi-region until we crossed 20,000 TPS, which took another eight months after launch.

A single-region deployment with synchronous standby and a well-rehearsed failover runbook would have served the first year of production traffic adequately. RTO of 5 minutes is acceptable for most payment use cases; the card networks have that or worse. We would have shipped four months earlier, onboarded paying customers four months earlier, and reached the transaction volumes that actually justified multi-region four months sooner.

The argument for building it upfront is that payment infrastructure is hard to migrate after the fact, and you do not want to do a major architectural change while you are also under production load. That argument is true but it is not a justification for building for 10x your current scale on day one. Build for 3x. Instrument everything. Migrate incrementally when the data tells you it is time.

On premature scale engineering: The fintech context makes this trap especially seductive. "We are building payment infrastructure; it has to be five-nines reliable and infinitely scalable" is a mantra that leads to four months of engineering complexity you cannot monetize. Reliability requirements should be scoped to the actual SLA your customers have contracted for, not the theoretical maximum of what the technology can deliver. A 99.9% SLA does not require active-active multi-region. 99.99% probably does not either.

Where We Are Now

Hyperion-X is in production processing live transaction volume for a set of financial institutions that we are not at liberty to name publicly yet. The 50,000 TPS figure is our demonstrated peak throughput in load testing; production peak is currently around 18,000 TPS, growing steadily as onboarding progresses. We expect to cross the load-tested ceiling within 18 months and are already planning the next capacity tier.

The post-quantum crypto layer is live. The TLA+-verified authorization state machine is in production and has been for 14 months without a settlement discrepancy. The M-Pesa idempotency bug has not recurred. The AML false positive rate is still higher than we want it and we are actively working on it.

The things I am most proud of are not the 50,000 TPS number; that is just engineering, and if we needed 100,000 TPS we would build for it. I am proud that we built a system that speaks the protocols African financial infrastructure actually uses, that we took the post-quantum threat seriously before it was fashionable to do so, and that one three-day debugging session permanently changed how we think about correctness. Those are the decisions that compound over time.

If you are building payment infrastructure for African markets and want to compare notes, we are reachable. The problems in this space are genuinely interesting and there are not enough people working on them seriously.

Continue Reading

All Posts

Browse the full blog →

Product

Hyperion-X product page →