# 20 — Observability & SLOs

We cannot operate what we cannot see.

## Three pillars

| Pillar | Tool class | What we capture |
|---|---|---|
| Logs | Structured JSON via Pino → log aggregator | One line per request; no PII; trace-id, user-id (where authorized), route, status, latency |
| Metrics | OpenTelemetry → Prometheus / Datadog | Request rate, error rate, latency histograms, business counters |
| Traces | OpenTelemetry | Cross-service spans with `traceId` linking client → API → DB → provider |

Errors go to Sentry (or equivalent) on all four apps; release tags identify deploys.

## Log policy

- Levels: `trace`, `debug`, `info`, `warn`, `error`, `fatal`.
- All logs: `traceId`, `service`, `version`, `env`.
- No PII in logs; redaction list in the logger config.
- Audit logs (in DB) are not the same as application logs.

## Health checks

- `GET /v1/health` — liveness.
- `GET /v1/ready` — readiness (DB + Redis reachable).
- Mobile and web read these to render service-status UI when degraded.

## Service Level Objectives (P1)

| SLI | SLO | Window |
|---|---|---|
| API availability (5xx rate < 1%) | 99.9% | 30 days |
| API p95 latency (read paths) | ≤ 400 ms | 30 days |
| API p95 latency (write paths) | ≤ 800 ms | 30 days |
| Mobile crash-free sessions | ≥ 99.6% | 30 days |
| Booking checkout success rate | ≥ 95% | 30 days |
| Webhook delivery success | ≥ 99% | 30 days |
| Refund processing within SLA | ≥ 95% | 30 days |

Each SLO has an **error budget**: when consumed > 50%, non-essential changes are paused.

## Alerting

- **Page** (wakes on-call): error budget burn fast (>2x for 1 h), p95 > 2× SLO for 10 min, paymen webhook failures > 5 in 5 min, DB connection saturation > 90%.
- **Ticket** (file an issue): error budget burn slow (>1.5× for 6 h), unusual 4xx spikes, dependency vulnerability published.
- **Notify** (Slack channel only): info-level deploys, scheduled maintenance.

Alert rule changes go through code review like everything else.

## Dashboards

- **Customer-facing** (status.navi.ae): green/yellow/red per surface; sourced from synthetic checks.
- **Engineering on-call:** golden signals per service (rate, errors, duration, saturation).
- **Business:** GMV, bookings, activations, partner KPIs (separately, not on the engineering dash).

## Synthetic checks

- Every 60 s: signup flow (email + OTP), login, GET listings, GET emergency.
- Every 5 min: end-to-end booking with mock-payment provider.
- Every 15 min: webhook simulation.

Failures fire pages on the on-call rotation.

## Tracing

- Every API request gets a `traceId` (header `x-request-id`); response carries it back.
- Mobile and web propagate it on all calls.
- Provider calls (payment, OCR, AI) carry the parent span.
- Sample rate: 100% of errors, 10% of successes (tunable).

## On-call

- 24/7 rotation across SREs/senior engineers; primary + secondary per shift.
- A page must be acknowledged within 5 minutes; mitigation within 30; comms posted to status page within 15.
- Postmortems for every page-worthy incident; blameless template at `docs/templates/incident-postmortem-template.md`.

## Documented assumptions

- We pick one observability stack (e.g. Datadog or Grafana + Loki + Tempo + Prometheus) by end of Phase 1; the API code uses OpenTelemetry abstractions so the back-end can swap without rewrites.
- Status page SaaS (statuspage.io / instatus / better-stack) is acceptable; rolling our own is not.
