Technical

How MeGuard Tracks 5,000 Guards in Real Time with a $12 GPS Device

The first version of our GPS protocol normalizer supported exactly 12 device types. We now support 200+. That gap represents every PSC client who called saying their tracker "doesn't work" with our system. Collectively those calls cost us several months of engineering time, two client relationships that needed repair, and one contract review meeting we'd rather not repeat. We've made that call disappear. Here's how.

The Hardware Reality

African private security companies buy whatever GPS tracker is cheapest on a given day. This is rational procurement: when you need to equip 300 guards before next Monday, you go with whatever's in stock at the electronics market in Kirinyaga Road. In practice, that means we see Teltonika FMB920s on the same deployment as no-name Alibaba units that cost $8 landed and speak a dialect of the GT06 protocol that doesn't match any public documentation we've found.

The two dominant protocols across the East African security market are GT06 and JT808. Between them they cover roughly 70% of what we encounter. The GT06 specification we found online, the most widely circulated version mirrored across at least six sites, had three errors in the packet framing section. Not ambiguities: errors. We found them by capturing real traffic from known-working devices and comparing byte-by-byte against the documented structure.

A minimal GT06 login packet looks like this:

; GT06 Login Packet — annotated hex
;
78 78          ; Start marker (0x78 0x78 = single-byte length variant)
0D             ; Packet length: 13 bytes (Protocol Number → Serial Number inclusive)
01             ; Protocol Number: 0x01 = Login
08 60 27 84    ; IMEI digits 1–8, BCD encoded
 19 94 11 40   ; IMEI digits 9–15 + padding nibble
00 01          ; Serial Number (increments per packet, wraps at 0xFFFF)
22 F5          ; CRC-16/IBM checksum (Protocol Number through Serial Number)
0D 0A          ; Stop bytes

; The bad-doc claim: CRC seed = 0xFFFF
; What devices actually use: CRC seed = 0x0000
; Devices affected: approximately 30% of Alibaba GT06 clones
; Symptom: every location packet silently dropped at checksum validation

That CRC seed discrepancy meant every packet from a large class of devices was failing our checksum and being silently dropped. Guards were walking off-route. Supervisors were watching a frozen dot on the map and attributing it to connectivity. It wasn't connectivity.

The Normalization Layer

The naive approach is to add a special case per device type. We did that for the first eight variants. By device type twelve we had a parser file with 600 lines of if-else chains, comments like // this unit swaps lat/lon when heading > 180, and two engineers who refused to touch it without a pair. It was not going to scale.

We replaced it with a protocol DSL, a declarative descriptor format that describes packet structure, field encodings, and known device quirks without touching parser code. Adding a new device type is writing a descriptor, usually 15–25 lines. The parser is a generic interpreter of those descriptors. A GT06 variant descriptor looks roughly like this:

# device-descriptor: gt06-alibaba-clone-v3
extends: gt06-base
crc_seed: 0x0000          # correct value; base uses 0xFFFF (bad-doc default)

location_packet:
  protocol_number: 0x22   # some clones use 0x22 instead of standard 0x12
  fields:
    latitude:
      type: uint32_be
      scale: 1.0e-6
      hemisphere_bit: course_status[10]   # bit 10 of course_status word = N/S
    longitude:
      type: uint32_be
      scale: 1.0e-6
      hemisphere_bit: course_status[11]   # bit 11 = E/W
    altitude:
      type: int16_be
      unit: cm                            # this vendor reports cm, not m
      convert_to: m
    speed:
      type: uint8
      unit: knots
      convert_to: kph

Every parsed packet is normalized into a TelemetryFrame: device IMEI, UTC timestamp, WGS84 coordinates, speed, heading, GPS fix quality, battery voltage, signal strength, and an opaque extensions dict for vendor-specific fields we preserve but haven't modeled. We reject device-local time entirely: too many devices have clocks that drift or reset on power cycle. All timestamps come from the server-side receive time, corrected against the device's reported GPS timestamp when fix quality is confirmed.

We went from roughly three days of engineering time per new device type to an afternoon, most of which is capturing a traffic sample to validate the descriptor. The descriptor goes into version control, gets a device family tag, and is reviewable by anyone who can read YAML-style config: no C or binary parsing expertise required.

Scale: What 5,000 Concurrent Devices Actually Looks Like

Five thousand devices at 30-second poll intervals is approximately 167 messages per second at steady state. That's not a large number on its own. The problem is the burst shape. Shift changes happen at 06:00, 14:00, and 22:00. A company deploying 800 guards on a single shift will have most devices send their login handshake and first location packet within a 4-minute window. Measured peak during shift change: around 400 messages per second. The average-to-peak ratio is roughly 2.4x, which matters a lot when you're provisioning consumers.

The ingestion architecture: TCP listeners per protocol family (GT06 and JT808 speak TCP; some older units use UDP) feed into Kafka. Each device gets a dedicated Kafka partition keyed on IMEI. Partition-per-device is wasteful at low device counts but it gives a hard guarantee: location events for a given guard are always processed in arrival order. No reordering edge cases, no sequence number deduplication complexity in the consumer. At 5,000 devices and 30-second intervals the partition count is manageable and the ordering guarantee is worth the overhead.

Consumers normalize raw packet bytes into canonical TelemetryFrame records and write to two sinks simultaneously: TimescaleDB for time-series track history and a PostGIS-enabled Postgres instance for live geofence evaluation. TimescaleDB's chunk partitioning by time means a query for "last 30 minutes of track for device X" touches one chunk on disk, not the full table. Compression on chunks older than seven days drops storage by roughly 90%: GPS coordinates with timestamps compress extremely well.

Geofence evaluation is a PostGIS ST_Within check against the set of active beat assignments for the device. We pre-simplify site boundary polygons at import time using ST_SimplifyPreserveTopology with a tolerance of approximately 5 metres, which eliminates vertices that don't affect geofence accuracy at guard-patrol resolution and brings worst-case evaluation time on a complex site boundary from around 200ms to around 40ms. A geofence miss alert goes from GPS ping received to supervisor SMS delivered in under 500ms end-to-end at the 95th percentile. The p99 sits around 800ms, almost entirely attributable to SMS gateway variability: everything upstream of the gateway is under 150ms combined.

The M-Pesa Double-Credit Bug

A guard called us. He had received his weekly pay three times: three identical M-Pesa B2C transfers from his employer's paybill, totalling roughly KES 42,000 he was not entitled to. He was, understandably, not complaining loudly. We were.

What broke: We had been treating M-Pesa B2C callbacks as fire-and-forget. Safaricom retries failed callbacks up to four times with exponential backoff. Our idempotency check was on the ConversationID from the Daraja response, but we were generating a new ConversationID on each retry of the originating request. The same payment was being initiated three times, each with a unique identifier, each passing our check cleanly. Fix: derive the idempotency key from (guard_id, pay_period_start, pay_period_end, gross_amount_cents). Same tuple, same key, always. If that key exists in the payments table, return the existing result without re-initiating. We've had zero double-pays since.

The root cause was a conceptual mismatch between what we thought idempotency meant and what Safaricom's retry behavior actually required. We were thinking about it at the callback layer: don't process the same callback twice. The problem was upstream: we were making the same payment multiple times and treating each attempt as a distinct transaction. Idempotency has to be enforced at the business-logic level before anything reaches the API.

The Daraja B2C API has an OriginatorConversationID field that you control. We now set it to a deterministic hash of the business-level tuple, which means Safaricom's own systems can detect and reject duplicate origination attempts before they ever generate a callback. Our database check is the first layer, Safaricom's deduplication is the second, and the callback handler is the third. Defense in depth costs nothing here.

import hashlib

def payroll_idempotency_key(
    guard_id: str,
    pay_period_start: str,   # ISO date
    pay_period_end: str,     # ISO date
    gross_amount_cents: int,
) -> str:
    """
    Deterministic key for B2C payment deduplication.
    Used as OriginatorConversationID in the Daraja request
    and as the dedup key in mpesa_transactions.
    Stable across retries. Changing any component produces a new key.
    """
    raw = f"{guard_id}|{pay_period_start}|{pay_period_end}|{gross_amount_cents}"
    return hashlib.sha256(raw.encode()).hexdigest()[:32]

The guard returned the money. We did not ask him to, he called and asked how to send it back. That's the kind of thing that makes the embarrassing bugs easier to process.

The ARC Module: EN 50518 Is Not Checkbox Work

EN 50518 is the European standard for Alarm Receiving Centre operations. Several of MeGuard's larger PSC clients operate ARCs for multinational clients who require EN 50518 compliance by contract. The standard sounds like paperwork until you read what Part 3 actually specifies: a full audit trail for every alarm, defined escalation timeouts with specific maximum response windows, and mandatory operator acknowledgement workflows. It is, in effect, a state machine specification written as a compliance document.

We implemented it as an explicit state machine. Every alarm in the ARC module has a lifecycle with enforced transitions:

-- Alarm lifecycle (EN 50518 compliant)
--
-- RECEIVED      → ACKNOWLEDGED   operator must ack within T_ack (default: 60s for Grade 3+)
-- ACKNOWLEDGED  → DISPATCHED     operator logs action taken
-- DISPATCHED    → RESOLVED       confirmation from site or response unit
-- Any state     → ESCALATED      automatic on timeout; duty supervisor paged immediately
-- ESCALATED     → ACKNOWLEDGED   supervisor picks up and works the alarm
--
-- Illegal transitions are rejected at the application layer.
-- Every transition is written to an append-only audit log:
--   (alarm_id, operator_id, from_state, to_state, timestamp_utc, notes)

ALLOWED_TRANSITIONS = {
    'RECEIVED':      ['ACKNOWLEDGED', 'ESCALATED'],
    'ACKNOWLEDGED':  ['DISPATCHED',   'ESCALATED'],
    'DISPATCHED':    ['RESOLVED',     'ESCALATED'],
    'ESCALATED':     ['ACKNOWLEDGED', 'DISPATCHED', 'CLOSED'],
    'RESOLVED':      ['CLOSED'],
    'CLOSED':        [],
}

The escalation timer runs as a background job. If an alarm sits in RECEIVED beyond the configured T_ack window without an acknowledgement, it automatically transitions to ESCALATED and pages the duty supervisor directly, not via the same operator queue that missed it the first time. The operator dashboard shows a live countdown for every unacknowledged alarm. This is not a UI nicety; it is a compliance requirement and, in a security context, simply correct behavior. An unacknowledged alarm is a failure mode, not a normal state.

After any incident, whether a guard injury, a site breach, or a disputed response time, the client can pull the complete alarm record and see exactly who acknowledged it, when, what they logged, and what action was taken. The trail survives operator turnover, shift changes, and insurance audits.

East African Infrastructure Reality: Design for 3G

Most patrol areas outside Nairobi CBD are on 3G. Rural sites, including industrial estates on the fringes of secondary cities, agricultural operations, and remote mining claims, have intermittent coverage or none at all. "Connectivity was unavailable" is not an acceptable explanation to a client when a guard was supposed to complete a checkpoint tour at 3am and there is no record of it happening.

We addressed this by specifying 7-day local track history storage as a minimum requirement for any device we certify for MeGuard deployment. When a device reconnects after an offline period it initiates a bulk sync of the stored backlog before resuming live reporting. At 30-second poll intervals, 7 days is 20,160 records. At full GT06 packet size (roughly 78 bytes per location record) that's under 1.6 MB, well within the flash storage of any device in our fleet.

The pipeline handles backlog frames differently from live frames. A backlog packet arrives with a device timestamp that may be hours in the past. We write it to TimescaleDB using the device timestamp, not the ingestion time, so the track history is accurate. We do not trigger geofence evaluation or beat enter/exit events for historical frames: those events are in the past, and replaying them into the dispatch workflow would generate false alerts. Historical frames are flagged with is_backlog: true when the device's reported timestamp is more than 90 seconds behind server time. Clients can replay historical tracks and see exactly where a guard was during an offline window.

The 500ms Alert Pipeline

End-to-end geofence alert latency (p50 / p99): GPS device transmits ping → TCP listener receives packet (~10ms) → Kafka ingest (~15ms) → normalization consumer (~20ms) → PostGIS ST_Within geofence check (~40ms, GiST indexed) → alert record written to Postgres (~10ms) → SMS gateway API call (~30ms) → Safaricom delivery (~350ms p50 / ~600ms p99). Total: ~475ms p50, ~735ms p99. Everything upstream of the SMS gateway runs in under 130ms combined. The gateway is the long pole and it always will be.

The geofence query runs against a PostGIS read replica with a GiST spatial index on the beat geometry column and a covering index on guard_assignments(device_imei, shift_start, shift_end). The geometry is stored as geography (not geometry) so ST_Within handles the WGS84 spheroid correctly without a projection step, relevant for sites near Mombasa Island with maritime boundary polygons that interact oddly with planar projections.

Alert fan-out goes to three channels simultaneously: supervisor in-app notification (FCM, ~50ms delivery), supervisor SMS (Africastalking gateway), and an entry in the ARC module's alarm queue if the deployment has ARC integration enabled. The in-app notification is the fast path; SMS is the reliable fallback for supervisors who are in the field and may not have the app foregrounded.

What We Learned

The protocol diversity problem doesn't get easier with time: it gets wider. New manufacturers appear constantly, each with their own GT06 dialect, their own interpretation of what a valid checksum looks like. Building the DSL-based normalizer was the right call at device type eight. We should have built it at device type three. The cost of the if-else approach compounds fast and the refactor window closes.

Designing for 3G from the start changes every assumption you have about connectivity. Offline-first is not a feature you add later. A system that loses data when the network drops is a liability in this market, not a product.

And the M-Pesa lesson: idempotency is a property of your business logic, not your API handlers. Derive your keys from business-meaningful tuples that are stable across retries. Set them at the point of origination. Let both your system and the payment provider use the same key. Verify this before go-live, because the person who finds the bug might be a guard who received three salaries and still called you to ask how to return the money.