# 21 — Infrastructure & Deployment

## Environments

| Env | Audience | Data | Auto-deploy |
|---|---|---|---|
| `local` | Developers | Synthetic / seed | n/a |
| `dev` | Engineers | Synthetic / seed | every commit on `main` |
| `staging` | QA, partner UAT | Anonymized snapshots | every commit on `main` |
| `production` | Travelers, partners | Real | manual promotion + canary |

`dev` and `staging` carry production parity (same providers, mocked secrets where needed).

## Cloud (target)

UAE-region infrastructure preferred for residency and latency. Default to one cloud provider; abstract anything that would lock us in.

```
                ┌─ Cloudflare CDN ─────────────────────────┐
                │                                          │
                ▼                                          ▼
            navi.ae                              dashboard.navi.ae
              (web)                                  (web)
                │                                          │
                ▼                                          ▼
        ┌────────────── api.navi.ae (NestJS pods) ──────────────┐
        │                                                        │
        ▼                                                        ▼
   PostgreSQL (HA, backups)                            Redis (HA)
   S3 / R2 (media)                                    BullMQ workers
                                                       (queues, retries)
```

App tiers run on managed Kubernetes (EKS / GKE) or a serverless-container target (Cloud Run / Fly.io) — chosen by the SRE based on operational maturity.

## Domains

- `navi.ae` — marketing site
- `dashboard.navi.ae` — admin/partner
- `api.navi.ae` — API v1 (`/v1`)
- `cdn.navi.ae` — media
- `status.navi.ae` — status page

Mobile uses `EXPO_PUBLIC_API_URL` to choose env.

## CDN & edge

- All static assets served from Cloudflare/CloudFront.
- Cache policies: marketing site aggressive, dashboard none, API selective per route.
- WAF in front of API and dashboard with managed rules + bot protection.

## Containerization

- Each app is a Docker image: `navi-api`, `navi-dashboard`, `navi-website`. Mobile is an EAS build.
- Images built in CI and pushed to a private registry tagged with git SHA + semver.
- SBOM generated per image; vulnerability scan on push.

## Infrastructure as Code

- Terraform for cloud resources (VPC, RDS, Redis, S3 buckets, IAM, secrets).
- Helm charts for Kubernetes deployments OR Pulumi for an all-in-one approach.
- All infra changes through PR review, plan published in PR.

## Database

- Managed Postgres 16 with read replica.
- Daily snapshots (35-day retention), 5-minute PITR window.
- Connection pooling via PgBouncer / RDS Proxy.
- Migrations: Prisma `migrate deploy` in a release job; **never** during regular pod boot. Migrations are forward-compatible (online); destructive changes are gated behind feature flags.

## Caching

- Redis for session bookkeeping, idempotency keys, rate limit buckets, light cache.
- TTLs explicit; no unbounded caches.
- Cache keys versioned with the deploy SHA where stale invalidation matters.

## Queues

- BullMQ on Redis for async jobs: email, push, audit dispatch, webhook retries, report generation.
- Each queue has a DLQ; the DLQ is observed.

## Release process

1. PR → CI green → merge to `main`.
2. `main` deploys to `dev` automatically.
3. After QA, `staging` is promoted.
4. Production is a canary deploy: 5% of traffic for 30 min → 25% → 50% → 100%.
5. Automated rollback if SLO burn or error rate exceeds thresholds.

Feature flags (e.g. LaunchDarkly, GrowthBook, or our own `FeatureFlag` model) gate risky changes.

## Backups

- Postgres: managed snapshots + WAL archiving for PITR.
- S3 / R2 versioning enabled; lifecycle rules for cost.
- Backup restore drills monthly in `staging` (real RPO/RTO measurement).

## Secrets

- AWS Secrets Manager / GCP Secret Manager / Doppler.
- Pods receive secrets via projected volumes / sidecar; never via env files in images.
- Rotation policy: JWT secrets quarterly; database passwords semi-annually; vendor keys at vendor cadence.

## Observability stack

See `20-observability-and-slos.md`.

## Cost & autoscaling

- Horizontal pod autoscaling on CPU + custom request/sec metric.
- Aggregated cost budgets per env, alerting at 70/85/95% of monthly cap.
- See `25-finops.md`.

## Documented assumptions

- One primary cloud provider in P1; multi-cloud is a P4 conversation only if we have a specific reason.
- Cloudflare WAF + DNS is the entry point; we always retain the option to fail over to another DNS.
- Mobile builds use EAS for OTA updates and store submission.
