Skip to content

baldur.services — Control API & Metrics

The control-API request/response surface for runtime operations, and the metrics helpers. The metrics symbols resolve lazily and require the [prometheus] extra at runtime.

Control API

ControlAPIService

ControlAPIService()

Baldur Control API Service.

Provides a unified, auditable, reversible, and governed control surface to manage reliability behaviors across testing, chaos experimentation, and real production operations.

Usage

service = ControlAPIService()

Execute control action

response = service.execute(ControlRequest( service_name="payment", action="allow", reason="PG recovered", environment="ops" ))

Get current status

status = service.get_status(environment="ops")

Get audit logs

logs = service.get_audit_logs(service_name="payment")

Initialize the Control API Service.

execute

execute(request: ControlRequest) -> ControlResponse

Execute a control API action.

Parameters:

Name Type Description Default
request ControlRequest

Control request

required

Returns:

Type Description
ControlResponse

ControlResponse with outcome

get_status

get_status(environment: str = 'ops') -> dict

Get the current status of all services.

Parameters:

Name Type Description Default
environment str

Current environment context

'ops'

Returns:

Type Description
dict

Status dictionary with all service states

get_service_status

get_service_status(service_name: str) -> dict

Get the status of a specific service.

Parameters:

Name Type Description Default
service_name str

Service to check

required

Returns:

Type Description
dict

Service state dictionary

is_failure_injection_active

is_failure_injection_active(service_name: str) -> bool

Check if failure injection is active for a service.

Parameters:

Name Type Description Default
service_name str

Service to check

required

Returns:

Type Description
bool

True if failures should be injected

get_failure_injection_config

get_failure_injection_config(
    service_name: str,
) -> dict | None

Get failure injection configuration for a service.

Parameters:

Name Type Description Default
service_name str

Service to check

required

Returns:

Type Description
dict | None

Configuration dict or None

get_metrics

get_metrics() -> dict

Collect comprehensive baldur metrics for trend analysis.

Returns operational metrics for dashboards, AI agents, and monitoring. Unlike status (point-in-time snapshot), metrics provide trend data.

Consumers: - Admin UI: Dashboard visualization - AI Agent: Automated decision making - Prometheus/Grafana: Metrics scraping - External Monitoring: Alerting integration

Returns:

Type Description
dict

Dictionary with comprehensive metrics data

ControlRequest dataclass

ControlRequest(
    service_name: str,
    action: str,
    reason: str,
    environment: str,
    ttl_minutes: int | None = None,
    request_id: str = (lambda: str(uuid.uuid4()))(),
    metadata: dict = dict(),
    actor: str = "system",
    actor_role: str = "automation",
)

Internal representation of a control API request.

ControlResponse dataclass

ControlResponse(
    status: str,
    action_applied: str,
    system_state: str = "",
    effective_until: str | None = None,
    reason_classification: str = "",
    evidence: dict = dict(),
    correlation_id: str = (lambda: str(uuid.uuid4()))(),
    error_code: str = "",
    error_message: str = "",
    risk_level: str = "",
)

Bases: SerializableMixin

Internal representation of a control API response.

Metrics

record_sla_breach

record_sla_breach(domain: str) -> None

Record an SLA breach event.

Parameters:

Name Type Description Default
domain str

Business domain where breach occurred

required

collect_all_metrics

collect_all_metrics() -> dict

Collect all baldur metrics.

This should be called by a periodic Celery task.

Returns:

Type Description
dict

Dictionary with all current metric values

ALERTING_RULES module-attribute

ALERTING_RULES: dict = {
    "DLQPendingHigh": {
        "expr": "dlq_pending_count > 10",
        "for": "5m",
        "severity": "warning",
        "team": "ops",
        "summary": "DLQ pending count is high",
        "description": "More than 10 items pending in DLQ for domain {{ $labels.domain }}",
        "runbook_url": "https://docs.internal/runbooks/dlq-pending-high",
    },
    "DLQPendingCritical": {
        "expr": "dlq_pending_count > 50",
        "for": "5m",
        "severity": "critical",
        "team": "ops",
        "summary": "DLQ pending count is critical",
        "description": "More than 50 items pending in DLQ for domain {{ $labels.domain }}",
        "runbook_url": "https://docs.internal/runbooks/dlq-pending-critical",
    },
    "DLQGrowthRateHigh": {
        "expr": "rate(dlq_created_total[5m]) > 5",
        "for": "5m",
        "severity": "warning",
        "team": "ops",
        "summary": "DLQ growth rate is high",
        "description": "More than 5 new DLQ items per minute for domain {{ $labels.domain }}",
        "runbook_url": "https://docs.internal/runbooks/dlq-growth-high",
    },
    "RetrySuccessRateLow": {
        "expr": "retry_success_rate < 70",
        "for": "15m",
        "severity": "warning",
        "team": "dev",
        "summary": "Retry success rate is low",
        "description": "Retry success rate below 70% for domain {{ $labels.domain }}",
        "runbook_url": "https://docs.internal/runbooks/retry-success-low",
    },
    "CircuitBreakerOpen": {
        "expr": "circuit_breaker_state == 1",
        "for": "1m",
        "severity": "critical",
        "team": "ops",
        "summary": "Circuit breaker is open",
        "description": "Circuit breaker for {{ $labels.service }} is in OPEN state",
        "runbook_url": "https://docs.internal/runbooks/circuit-breaker-open",
    },
    "CircuitBreakerOpenLong": {
        "expr": "circuit_breaker_state == 1",
        "for": "10m",
        "severity": "critical",
        "team": "ops",
        "summary": "Circuit breaker open for extended period",
        "description": "Circuit breaker for {{ $labels.service }} has been open for more than 10 minutes",
        "runbook_url": "https://docs.internal/runbooks/circuit-breaker-extended",
    },
    "SLABreachDetected": {
        "expr": "increase(sla_breach_total[1h]) > 0",
        "for": "0m",
        "severity": "warning",
        "team": "ops",
        "summary": "SLA breach detected",
        "description": "SLA breach detected for domain {{ $labels.domain }}",
        "runbook_url": "https://docs.internal/runbooks/sla-breach",
    },
    "RecoveryTimeSlow": {
        "expr": "histogram_quantile(0.95, rate(recovery_time_seconds_bucket[1h])) > 1800",
        "for": "15m",
        "severity": "warning",
        "team": "ops",
        "summary": "Recovery time P95 is slow",
        "description": "95th percentile recovery time exceeds 30 minutes",
        "runbook_url": "https://docs.internal/runbooks/recovery-slow",
    },
    "HumanReviewQueueLong": {
        "expr": "histogram_quantile(0.95, rate(human_review_queue_time_seconds_bucket[1h])) > 3600",
        "for": "30m",
        "severity": "warning",
        "team": "ops",
        "summary": "Human review queue time is high",
        "description": "Items waiting more than 1 hour for human review",
        "runbook_url": "https://docs.internal/runbooks/review-queue-long",
    },
    "ReplayFailureRateHigh": {
        "expr": "sum(rate(replay_outcomes_total{outcome='failure'}[1h])) / sum(rate(replay_outcomes_total[1h])) > 0.5",
        "for": "15m",
        "severity": "warning",
        "team": "dev",
        "summary": "Replay failure rate is high",
        "description": "More than 50% of replay attempts are failing",
        "runbook_url": "https://docs.internal/runbooks/replay-failure-high",
    },
    "ErrorBudgetCritical": {
        "expr": 'error_budget_remaining_percent{tier="critical"} < 10 or error_budget_remaining_percent{tier="standard"} < 20 or error_budget_remaining_percent{tier="non_essential"} < 30',
        "for": "5m",
        "severity": "critical",
        "team": "ops",
        "summary": "Error budget critical - tier-aware deployment freeze",
        "description": "Error budget remaining is {{ $value }}% (tier={{ $labels.tier }}, region={{ $labels.region }}). Deployment freeze is recommended.",
        "runbook_url": "https://docs.internal/runbooks/error-budget-critical",
    },
    "ErrorBudgetWarning": {
        "expr": 'error_budget_remaining_percent{tier="critical"} < 30 or error_budget_remaining_percent{tier="standard"} < 50 or error_budget_remaining_percent{tier="non_essential"} < 60',
        "for": "10m",
        "severity": "warning",
        "team": "ops",
        "summary": "Error budget warning - tier-aware",
        "description": "Error budget remaining is {{ $value }}% (tier={{ $labels.tier }}, region={{ $labels.region }}). Consider reducing deployments.",
        "runbook_url": "https://docs.internal/runbooks/error-budget-warning",
    },
    "ErrorBudgetFastBurn": {
        "expr": "error_budget_burn_rate_1h > 14.4",
        "for": "5m",
        "severity": "critical",
        "team": "ops",
        "summary": "Fast error budget burn detected",
        "description": "1-hour burn rate is {{ $value }}x. Consuming 2%+ budget per hour.",
        "runbook_url": "https://docs.internal/runbooks/error-budget-fast-burn",
    },
    "ErrorBudgetSlowBurn": {
        "expr": "error_budget_burn_rate_6h > 3",
        "for": "30m",
        "severity": "warning",
        "team": "ops",
        "summary": "Slow error budget burn detected",
        "description": "6-hour burn rate is {{ $value }}x. Sustained elevated error rate.",
        "runbook_url": "https://docs.internal/runbooks/error-budget-slow-burn",
    },
    "DeploymentFreezeActive": {
        "expr": "deployment_freeze_status >= 3",
        "for": "0m",
        "severity": "info",
        "team": "ops",
        "summary": "Deployment freeze is active",
        "description": "Deployment freeze is recommended or in effect.",
        "runbook_url": "https://docs.internal/runbooks/deployment-freeze",
    },
    "FailSafeTriggered": {
        "expr": "increase(baldur_failsafe_triggered_total[5m]) > 0",
        "for": "0m",
        "severity": "critical",
        "team": "ops",
        "summary": "🚨 Baldur Fail-Safe mode activated",
        "description": "Baldur system component '{{ $labels.component }}' has failed and Fail-Safe mode is active. Deployments are proceeding but system needs immediate attention.",
        "runbook_url": "https://docs.internal/runbooks/baldur-failsafe",
    },
    "FailSafeModeActive": {
        "expr": "baldur_failsafe_mode_active == 1",
        "for": "2m",
        "severity": "critical",
        "team": "ops",
        "summary": "🚨 Baldur in degraded mode",
        "description": "Baldur '{{ $labels.component }}' is operating in Fail-Safe mode. Error Budget recommendations are not available. Investigate and restore normal operation immediately.",
        "runbook_url": "https://docs.internal/runbooks/baldur-failsafe",
    },
    "BaldurServiceDead": {
        "expr": "time() - baldur_heartbeat_timestamp_seconds > 120",
        "for": "0m",
        "severity": "critical",
        "team": "ops",
        "summary": "🔴 Baldur service is DEAD",
        "description": "No heartbeat received from Baldur '{{ $labels.component }}' for more than 2 minutes. The service may have crashed or is unresponsive. This is a critical infrastructure failure.",
        "runbook_url": "https://docs.internal/runbooks/baldur-dead",
    },
    "BaldurHeartbeatMissing": {
        "expr": "absent(baldur_heartbeat_timestamp_seconds) == 1",
        "for": "5m",
        "severity": "critical",
        "team": "ops",
        "summary": "🔴 Baldur heartbeat metric missing",
        "description": "The Baldur heartbeat metric is completely absent. The service may never have started or is not properly initialized.",
        "runbook_url": "https://docs.internal/runbooks/baldur-missing",
    },
    "OverrideEscalation": {
        "expr": "increase(baldur_override_escalation_total[1h]) > 0",
        "for": "0m",
        "severity": "warning",
        "team": "ops",
        "summary": "⚠️ Deployment override escalation",
        "description": "A deployment override of type '{{ $labels.override_type }}' was approved despite insufficient error budget. This action requires governance review.",
        "runbook_url": "https://docs.internal/runbooks/override-escalation",
    },
    "OverrideEscalationHigh": {
        "expr": "increase(baldur_override_escalation_total[24h]) > 5",
        "for": "0m",
        "severity": "critical",
        "team": "ops",
        "summary": "🚨 Excessive deployment overrides",
        "description": "More than 5 deployment overrides in the last 24 hours. This may indicate process issues or sustained reliability problems.",
        "runbook_url": "https://docs.internal/runbooks/override-escalation-high",
    },
    "XTestCrossRegionDeniedRateHigh": {
        "expr": "rate(baldur_xtest_cross_region_denied_total[1m]) > 10",
        "for": "1m",
        "severity": "warning",
        "team": "security",
        "summary": "⚠️ High rate of cross-region X-Test denials",
        "description": "Cross-region X-Test denial rate exceeds 10/min. Current region: {{ $labels.current_region }}, Target region: {{ $labels.target_region }}. This may indicate misconfigured clients or attempted cross-region access.",
        "runbook_url": "https://docs.internal/runbooks/xtest-cross-region-denied",
    },
    "XTestCrossRegionDeniedFromSameSource": {
        "expr": "sum by (current_region, target_region) (increase(baldur_xtest_cross_region_denied_total[5m])) > 5",
        "for": "0m",
        "severity": "warning",
        "team": "security",
        "summary": "🔒 Repeated cross-region X-Test denials detected",
        "description": "More than 5 cross-region denials in 5 minutes from the same source. This may indicate a security issue or misconfigured automation. Investigate the source of these requests immediately.",
        "runbook_url": "https://docs.internal/runbooks/xtest-cross-region-repeated",
    },
    "TierStarvationNonEssential": {
        "expr": '(rate(baldur_rate_controller_dropped_total{tier="non_essential"}[5m]) / (rate(baldur_rate_controller_dropped_total{tier="non_essential"}[5m]) + rate(baldur_rate_controller_processed_total{tier="non_essential"}[5m])) > 0.99) and ((rate(baldur_rate_controller_dropped_total{tier="non_essential"}[5m]) + rate(baldur_rate_controller_processed_total{tier="non_essential"}[5m])) > 10)',
        "for": "10m",
        "severity": "warning",
        "team": "ops",
        "summary": "non_essential tier rejecting 99%+ — starvation suspected",
        "description": "Over the last 5 minutes, 99%+ of non_essential requests were rejected out of >10 total. If processed_by_tier_total is near 0, this is full starvation. Check the backpressure level and watermark settings.",
        "runbook_url": "https://docs.internal/runbooks/tier-starvation",
    },
}