Health Check

Ready-made health endpoints that tell your load balancer and Kubernetes exactly when to send traffic to a pod — and when to stop.

What is it?

Orchestrators and load balancers constantly ask every instance of your app two different questions: "are you alive?" (should I restart you?) and "are you ready?" (should I send you traffic?). Think of an aircraft: "is the engine running" and "is it cleared for passengers" are separate checks with separate consequences: confusing them means either restarting a plane that just needed a minute, or boarding passengers onto one that can't fly.

A health check endpoint answers those questions over HTTP, with the response code doing the talking: 200 means "all good", 503 means "act on it". In Baldur this is the Health Check feature: a set of pre-built endpoints that answer with full awareness of what the self-healing layer knows: database connectivity, connection-pool state, and the health of the resilience subsystems themselves.

Why it matters

Most teams hand-roll a /health view, and it usually fails in one of two directions:

It checks too little. It returns 200 unconditionally, so the load balancer keeps routing traffic to a pod whose database connection is gone.
It checks too much. It runs expensive checks on every probe, so the health endpoint itself becomes a load source when a balancer polls it several times a second.

Baldur's Health Check removes both failure modes:

Correct routing decisions. Only genuine unhealthiness (a severed database connection) returns 503 and takes the pod out of rotation. A pod that is degraded-but-serving stays in.
Cheap under polling. The full health verdict is served from a precomputed cache, and an ultra-light ping endpoint answers without touching the database at all.
Debuggable incidents. The response is structured: a status verdict per component, the count of circuit breakers in play, and whether the self-healing automation is switched on, not just a bare "down".

How it works in Baldur

Baldur exposes five endpoints, each answering a different question (shown under the conventional /api/baldur/ prefix; the same checks are also available on Baldur's built-in admin console for deployments without a web framework):

Endpoint	Question it answers	Response behavior
`health/`	"What is the full picture?"	Overall status plus per-component detail. `200` for healthy/degraded, `503` for unhealthy
`health/live/`	"Is the process alive?"	Always `200` while the app runs — even during shutdown drain
`health/ready/`	"Can it serve traffic?"	`200` only when every configured database connection is usable; otherwise `503`
`health/pool/`	"How are the connection pools?"	`200` when healthy; `503` when degraded or erroring
`health/ping/`	"Fastest possible yes"	Always `200`, no database access — built for high-frequency load-balancer checks

The overall verdict on health/ moves through three observable statuses:

stateDiagram-v2
    [*] --> HEALTHY
    HEALTHY --> UNHEALTHY: database becomes unreachable
    UNHEALTHY --> HEALTHY: database connectivity restored
    UNHEALTHY --> DEGRADED: database restored while a subsystem still reports trouble
    HEALTHY --> DEGRADED: a subsystem reports trouble (database still fine)
    DEGRADED --> HEALTHY: the subsystem recovers
    DEGRADED --> UNHEALTHY: database becomes unreachable

What you observe	When it happens
`"status": "healthy"`, HTTP `200`	The database is reachable and the self-healing subsystems report no trouble
`"status": "degraded"`, HTTP `200` — the pod stays in rotation	A self-healing subsystem reports trouble while the database is still fine. Degraded means "keep serving, but look into it"
`"status": "unhealthy"`, HTTP `503` — the load balancer depools the pod	The database connection is unusable. This is the only verdict that takes the pod out of traffic
Readiness flips to `503`	Any configured database connection is down — or the pod is draining during a graceful shutdown, so new traffic stops
Liveness and ping keep answering `200` during shutdown	Draining is a normal lifecycle phase, not a failure — keeping liveness green prevents the orchestrator from killing the pod mid-drain

Four points worth understanding:

Degraded never depools. Only unhealthy maps to 503 on the main endpoint (two rare internal-failure statuses, error and unavailable, also map to 503). A degraded pod that can still serve correctly is deliberately kept in rotation — depooling healthy capacity because a background subsystem hiccupped would make an incident worse, not better.
The payload is a diagnosis, not a verdict. Beyond the status, health/ reports each component's state, how many circuit breakers Baldur is currently tracking, whether the self-healing automation is switched on, and a timestamp. The pool endpoint additionally carries the error message when its check fails.
Responses are cached on purpose. The full verdict comes from a precomputed cache so that aggressive probe polling stays cheap. Append ?nocache=true to force a fresh computation — the response then marks the cache as bypassed.
PRO self-monitoring enriches the verdict. When Baldur's PRO-tier self-monitoring (Meta-Watchdog) is active, its findings appear in the health payload and a struggling subsystem it detects is what moves the overall status to degraded.

Configuration

Health Check works out of the box: the admin-console checks start together with Baldur itself, the /api/baldur/ endpoints mount through your web framework's normal URL routing (see Getting Started), and there are no health-check variables in the operator-tunable allowlist, so there is nothing you need to set. The one runtime control is per-request: ?nocache=true on health/ to bypass the cache. The complete operator-tunable list lives in the environment variables reference.

Health Check

What is it?

Why it matters

How it works in Baldur

Configuration

See also