Monitoring
In production, prioritize observability before you tweak configs. For most incidents, you want quick answers to:
- Is the Gateway process running and reachable (HTTP + WS)?
- Is auth working (token/password/tailscale headers)?
- Are provider calls failing (rate limits, auth, model-id mismatch)?
- Are tool runs blocked (sandbox/tool policy/exec approvals)?
Quick checks (operator workflow)
- Start with Doctor for a structured probe.
- Use Health to verify the Gateway is reachable and responding.
- If you suspect a “works locally but not remotely” situation, verify your bind/auth model in Remote access and Security.
Logs
When diagnosing failures, logs usually answer “what happened” faster than config diffs:
- Logging basics + where to look: Logging
- Common “it looks stuck” symptoms: Troubleshooting
What to monitor (minimum)
- Reachability: the HTTP port is open on the expected interface (loopback vs tailnet vs LAN).
- Auth failures: spikes in 401/403 typically mean bad token/password or missing tailscale headers.
- Provider errors: rate limits and auth failures should be visible in logs; confirm the active model/provider config.
- Background tasks: if you rely on heartbeat/cron/webhooks, ensure those triggers are firing.
Related: