# 22 — Disaster Recovery & Business Continuity

## Targets

| Tier | RPO (data loss) | RTO (downtime) |
|---|---|---|
| Tier 1 (API, Postgres, payments) | ≤ 5 min | ≤ 1 hour |
| Tier 2 (Dashboard, queues, search) | ≤ 1 hour | ≤ 4 hours |
| Tier 3 (Marketing site, static content) | ≤ 24 hours | ≤ 24 hours |

## Backup strategy

- Postgres: managed snapshots daily, retained 35 days; WAL archiving for 5-min PITR.
- Redis: AOF + snapshot; treated as cache, not the source of truth.
- S3 / R2: versioning, cross-region replication for media.
- Audit logs replicated to a write-once bucket for tamper resistance.
- Encrypted at rest; keys rotated on schedule.

## Drills

- Quarterly **restore drill**: spin up a copy of production from the latest snapshot in `staging`; verify writes resume; measure real RPO/RTO; record gaps.
- Annual **full DR drill**: simulate region failure; promote replica; reroute traffic; validate end-to-end.
- Drill results logged in `docs/operations/dr-drills/`.

## Failure scenarios & runbooks

### S0 — Region outage (cloud provider)
- Lead on-call announces incident on status page within 15 min.
- Promote read replica in alternative AZ; if cross-region replica exists, switch DNS.
- All writes paused until promotion completes; read paths degrade gracefully.

### S0 — Database corruption
- Stop writes; flip API to read-only mode (feature flag).
- Restore from PITR to a fresh instance; reconcile divergent data with audit log.
- Communicate to partners about possible delayed bookings.

### S0 — Payment provider outage
- Auto-fail to a secondary provider if configured (Phase 4); else notify users at checkout with a banner.
- Pause auto-cancellation jobs until provider recovers.

### S1 — Webhook backlog
- Workers scale up; DLQ inspected; dead messages reprocessed.
- Status page updated.

### S1 — Mass refund event (partner cancels en masse)
- FinanceManager runs the bulk-refund tool; audit logs all entries; email notifications sent to affected travelers.

### S2 — Mobile app crash spike post-deploy
- Roll back via OTA update.
- File a postmortem within 5 business days.

## Communication

- The single channel of truth is `status.navi.ae`.
- Twitter mirror only echoes the status page.
- Affected partners receive direct emails for partner-impacting events.
- A pre-approved holding statement exists for: "investigating", "identified", "monitoring", "resolved".

## Escalation matrix

| Severity | Page | Resolve by | Updates |
|---|---|---|---|
| S0 | All on-call + CTO + CPO | ≤ 1 h | every 15 min |
| S1 | Primary on-call + CTO | ≤ 4 h | every 30 min |
| S2 | Primary on-call | ≤ 24 h | every 2 h |
| S3 | Ticket triage | next business day | end-of-day |

## Postmortem

- Blameless template at `docs/templates/incident-postmortem-template.md`.
- Required for every S0 and S1.
- Action items tracked to closure; reviewed in monthly engineering all-hands.

## Business continuity

- A printed runbook lives at the office (laminated). It contains who to call, where status lives, and how to write the first post.
- Vendor contact tree is reviewed every 6 months.
- We hold a quarterly "people drill": what happens if the on-call is unavailable, the CTO is offline, and the country has a long weekend.

## Documented assumptions

- We accept a 5-min data-loss window for the lowest-cost setup; reducing it requires synchronous replication and is a P4 decision.
- Cross-region replication for Postgres ships in P3 once costs are predictable.
- Drills are mandatory; missing two consecutive drills is a P0 process incident.
