Gateway Runbook

Monitoring

Monitoring

In production, prioritize observability before you tweak configs. For most incidents, you want quick answers to:

Is the Gateway process running and reachable (HTTP + WS)?
Is auth working (token/password/tailscale headers)?
Are provider calls failing (rate limits, auth, model-id mismatch)?
Are tool runs blocked (sandbox/tool policy/exec approvals)?

Quick checks (operator workflow)

Start with Doctor for a structured probe.
Use Health to verify the Gateway is reachable and responding.
If you suspect a “works locally but not remotely” situation, verify your bind/auth model in Remote access and Security.

Logs

When diagnosing failures, logs usually answer “what happened” faster than config diffs:

Logging basics + where to look: Logging
Common “it looks stuck” symptoms: Troubleshooting

What to monitor (minimum)

Reachability: the HTTP port is open on the expected interface (loopback vs tailnet vs LAN).
Auth failures: spikes in 401/403 typically mean bad token/password or missing tailscale headers.
Provider errors: rate limits and auth failures should be visible in logs; confirm the active model/provider config.
Background tasks: if you rely on heartbeat/cron/webhooks, ensure those triggers are firing.

Related:

Heartbeat: Heartbeat
Cron jobs: Cron jobs

Logging Multiple Gateways