Monitoring and alerting
What to monitor and how to debug issues quickly.
Monitoring and alerting
Use monitoring to detect integration issues early and shorten support cycles when incidents happen.
What to log
At minimum, log these fields for proxied API requests:
- endpoint path and method,
- HTTP status,
- correlation ID,
- sanitized user/session context,
- auth mode classification (without secrets).
Correlation IDs are the key link between client-side reports and server-side traces. When possible, surface the correlation ID in user-visible error responses so support can trace the exact failing request.
Suggested alerting baselines
Start with simple alerts that match common failure modes:
- 403 spikes: can indicate nonce/CSRF or session integration issues.
- 429 spikes: can indicate retry loops, abusive traffic, or limits that are too strict.
- Request-size rejections: monitor payload-limit failures to catch client regressions or abuse.
- 5xx rate increases: flag upstream instability or degraded dependencies.
A practical starter policy:
- page immediately when health checks fail,
- page for sustained 5xx error-rate breaches,
- investigate sudden multiples of baseline for 403/429 patterns.
Health checks and self-test usage
Use lightweight checks continuously and deeper self-tests during incident triage.
Recommended operations pattern:
- Watch
/healthfor availability. - Track secure proxy response code trends by endpoint.
- Run self-test checks when symptoms appear (or pre-release) to verify boundary configuration.
- Compare current self-test output against known-good runs from staging/production.
This helps separate environmental drift from application regressions.
Evidence capture for support
When escalating to support, gather a minimal but complete packet:
- failing endpoint and HTTP method,
- timestamp and environment,
- returned status and error envelope,
- correlation ID,
- whether the request came from browser UI or server-side workflow.
For repeating incidents, include a short sample of grouped failures (for example, top affected endpoints and counts) rather than isolated screenshots. This speeds root-cause analysis and reduces back-and-forth.