Monitoring

In production, prioritize observability before you tweak configs. For most incidents, you want quick answers to:

  • Is the Gateway process running and reachable (HTTP + WS)?
  • Is auth working (token/password/tailscale headers)?
  • Are provider calls failing (rate limits, auth, model-id mismatch)?
  • Are tool runs blocked (sandbox/tool policy/exec approvals)?

Quick checks (operator workflow)

  • Start with Doctor for a structured probe.
  • Use Health to verify the Gateway is reachable and responding.
  • If you suspect a “works locally but not remotely” situation, verify your bind/auth model in Remote access and Security.

Logs

When diagnosing failures, logs usually answer “what happened” faster than config diffs:

What to monitor (minimum)

  • Reachability: the HTTP port is open on the expected interface (loopback vs tailnet vs LAN).
  • Auth failures: spikes in 401/403 typically mean bad token/password or missing tailscale headers.
  • Provider errors: rate limits and auth failures should be visible in logs; confirm the active model/provider config.
  • Background tasks: if you rely on heartbeat/cron/webhooks, ensure those triggers are firing.

Related: