# 23 — Data Platform

## Goals

- Answer product questions reliably (activation, conversion, retention, partner performance).
- Power the trip-planner AI without leaking PII.
- Stay compliant with PDPL/GDPR-equivalent obligations.

## Layers

```
Apps → Event collector → Object storage (raw) → Warehouse (curated) → BI / models
                                  ↓
                            Audit logs (DB)
```

- **Event collector:** server-side events from the API; client events via the analytics provider's SDK forwarded into the same warehouse.
- **Object storage:** S3 / R2 holds raw, immutable JSONL — the source of truth for replay.
- **Warehouse:** BigQuery / Snowflake / Databricks. Curated tables built via dbt. Schema-on-read for raw, schema-on-write for curated.
- **BI:** Metabase / Looker for internal use. Partner-facing analytics use the API only.

## Event taxonomy (initial)

| Event | When | Properties |
|---|---|---|
| `app_open` | Cold or warm start | `os`, `app_version`, `locale` |
| `signup_started` | Signup form opened | `source` |
| `signup_completed` | OTP verified | `time_to_signup_ms` |
| `listing_view` | Listing detail opened | `listing_id`, `kind`, `city_id` |
| `saved_added` / `saved_removed` | Heart toggled | `ref_type`, `ref_id` |
| `booking_started` | Checkout begin | `listing_id` |
| `booking_confirmed` | Booking row created | `booking_id`, `kind`, `gmv_minor`, `currency` |
| `payment_succeeded` / `_failed` | Provider event | `payment_id`, `provider_key`, `amount_minor` |
| `trip_generation_started` / `_completed` | AI run | `provider_key`, `latency_ms` |
| `support_ticket_opened` | Ticket created | `subject_class` |

Event schemas tracked in code (`packages/types/src/events.ts` — to be added in Phase 2) and validated server-side before write.

## PII in events

- No raw email or phone in events.
- `user_id` (CUID) is fine within the warehouse; do not export to third parties.
- Salted hashes of identifiers acceptable for cross-source joins.

## Data ownership

- **Engineering** owns event delivery, schemas, and storage SLAs.
- **Product** owns what we measure and how we name it.
- **Finance** owns money tables (payments, payouts, refunds).
- **Compliance** owns retention, deletion, and access controls.

## Retention

- Raw events: 24 months in cold storage, 90 days hot.
- Curated tables: rolling 36 months.
- Audit logs: minimum 7 years (write-once bucket).
- Personal data deletion requests propagate to raw + curated within 30 days.

## Access control

- Warehouse access is role-based; PII tables require an additional approval.
- Production reads from the warehouse via signed credentials; no raw secrets in notebooks.
- Query logs reviewed quarterly.

## Quality

- dbt tests on every curated table (uniqueness, not-null, row counts).
- Reconciliation: warehouse `bookings_confirmed` matches DB `Booking.status='CONFIRMED'` within 1%.
- Anomaly alerts on key counters.

## Trip-planner AI data

- Inputs to `AiProvider.generateTrip` are minimized: city ids, dates, party type, interests, budget, pace.
- Outputs (steps, prices) are stored in `Trip` / `TripStep` rows, not in third-party logs.
- Model training data, if used, is fully de-identified.
- Outputs are deterministic at the contract level; we can replace the provider without changing schemas.

## Documented assumptions

- We start with one warehouse and one BI tool; we don't multi-source until P4.
- Real-time analytics is not a launch requirement; daily refresh is fine for P1.
- The "metric definition" lives in dbt models, not in dashboards. A dashboard cannot redefine GMV.
