baldur_pro.services.emergency_mode — Emergency Mode
System-wide emergency levels and the manager that activates them:
EmergencyLevel, get_emergency_manager, and is_emergency_active.
🔒 PRO Feature — requires a baldur-pro license
These symbols ship in the baldur-pro distribution. PRO modules import
normally — there is no ImportError. PRO features activate only when
baldur.init() runs with a valid BALDUR_LICENSE_KEY; without it the system
runs with OSS defaults and register_pro_services() logs
entitlement.pro_registration_skipped.
emergency_mode
Emergency Mode Service — Single-instance emergency lifecycle.
Manages emergency mode activation/deactivation, traffic tier rules (Load Shedding), and gradual recovery for a single service instance.
For multi-region namespace-scoped emergency management (regional isolation,
cascade detection, partition reconciliation), see regional_emergency/.
Architecture
models.emergency.EmergencyLevel (shared domain type) │ ├── emergency_mode/ ← this package (single-instance lifecycle) └── regional_emergency/ (multi-region extension, depends on this)
Features: - EmergencyLevel: emergency levels (NORMAL, LEVEL_1, LEVEL_2, LEVEL_3) - GracefulDegradationManager: emergency mode activation/deactivation - RecoveryGate: metric-based recovery stabilization and gradual recovery
Usage
from baldur_pro.services.emergency_mode import ( get_emergency_manager, EmergencyLevel, is_emergency_active, )
Check emergency status
if is_emergency_active(): level = get_emergency_manager().get_current_level() ...
Activate emergency mode
get_emergency_manager().activate_manual( level=EmergencyLevel.LEVEL_2, reason="High error rate detected", activated_by="admin", duration_minutes=30, )
Deactivate emergency mode
get_emergency_manager().deactivate(deactivated_by="admin")
Status: Public
RecoveryGateConfig
dataclass
RecoveryGateConfig(
stabilization_period_seconds: int = 300,
require_metrics_stable: bool = True,
cpu_threshold_percent: float = 80.0,
error_rate_threshold: float = 0.05,
gradual_recovery: bool = True,
level_step_delay_seconds: int = 60,
health_check_interval_seconds: int = 30,
auto_rollback_on_failure: bool = True,
)
Bases: SerializableMixin
Stabilization-window configuration for safe emergency exit.
Defines how long the system must remain stable before the recovery gate releases emergency mode, plus the metric thresholds that count as "stable". Defaults are read from EmergencyModeSettings when available.
stabilization_period_seconds
class-attribute
instance-attribute
stabilization_period_seconds: int = 300
Stabilization wait window (seconds).
require_metrics_stable
class-attribute
instance-attribute
require_metrics_stable: bool = True
Whether metric-based stability checks are required.
cpu_threshold_percent
class-attribute
instance-attribute
cpu_threshold_percent: float = 80.0
CPU usage must be at or below this to count as stable.
error_rate_threshold
class-attribute
instance-attribute
error_rate_threshold: float = 0.05
Error rate must be at or below this (5%) to count as stable.
gradual_recovery
class-attribute
instance-attribute
gradual_recovery: bool = True
Whether to step down emergency level gradually.
level_step_delay_seconds
class-attribute
instance-attribute
level_step_delay_seconds: int = 60
Delay between level-down steps (seconds).
health_check_interval_seconds
class-attribute
instance-attribute
health_check_interval_seconds: int = 30
Metric re-check cadence during recovery (seconds).
auto_rollback_on_failure
class-attribute
instance-attribute
auto_rollback_on_failure: bool = True
Whether to roll back automatically when recovery fails.
from_settings
classmethod
from_settings() -> RecoveryGateConfig
Create from EmergencyModeSettings, with hardcoded fallback.
EmergencyLevel
Bases: str, Enum
Emergency level definitions.
Each level determines per-tier traffic multipliers. Ordering: NORMAL < LEVEL_1 < LEVEL_2 < LEVEL_3 (severity-based).
severity
property
severity: int
Numeric severity for ordering comparisons and backward compatibility.
from_severity
classmethod
from_severity(severity: int) -> EmergencyLevel
Create EmergencyLevel from numeric severity (0-3).
Supports legacy integer-based code that used IntEnum values.
EmergencyModeError
EmergencyModeError(message: str = '', *, code: str = '')
EmergencyStateError
EmergencyStateError(
message: str, *, operation: str = "", detail: str = ""
)
Bases: EmergencyModeError
Invalid state or input for emergency mode operation.
Covers: missing parameters, inactive emergency mode, duplicate recovery, invalid target level.
RecoveryNotAllowedError
RecoveryNotAllowedError(
message: str, *, check_reason: str = ""
)
GracefulDegradationManager
Bases: EventEmitterMixin
Emergency-mode manager - stepwise entry into / exit from emergency mode.
Implemented as a thread-safe singleton.
Usage
manager = GracefulDegradationManager()
Manually activate emergency mode
manager.activate_manual( level=EmergencyLevel.LEVEL_2, reason="High error rate", activated_by="admin", duration_minutes=30, )
Check the current state
state = manager.get_state()
Deactivate emergency mode
manager.deactivate(deactivated_by="admin")
close
close() -> None
Unsubscribe EventBus handlers.
Idempotent: safe to call multiple times.
get_state
get_state() -> EmergencyState
Get the current state.
get_current_level
get_current_level() -> EmergencyLevel
Get the current emergency-mode level.
is_active
is_active() -> bool
Whether emergency mode is active.
get_previous_states
get_previous_states() -> list[dict[str, Any]]
Get the list of previous-state snapshots.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
Snapshot list (newest first) |
rollback_to_previous
rollback_to_previous(
index: int = 0,
) -> EmergencyState | None
Roll back to a previous state.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
Snapshot index to roll back to (0=most recent, 1=the one before...) |
0
|
Returns:
| Type | Description |
|---|---|
EmergencyState | None
|
The rolled-back state, or None on failure |
get_tier_multiplier
get_tier_multiplier(tier_id: str) -> float
Return the tier multiplier for the current emergency-mode level.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tier_id
|
str
|
Tier ID (critical, standard, non_essential) |
required |
Returns:
| Type | Description |
|---|---|
float
|
Multiplier (0.0 ~ 1.0) |
activate_manual
activate_manual(
level: EmergencyLevel,
reason: str,
activated_by: str,
duration_minutes: int | None = None,
is_chaos_experiment: bool = False,
experiment_id: str | None = None,
override_kill_switch: bool = False,
) -> EmergencyState
Manually activate emergency mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
EmergencyLevel
|
Emergency-mode level |
required |
reason
|
str
|
Activation reason (required) |
required |
activated_by
|
str
|
User who activated it |
required |
duration_minutes
|
int | None
|
Auto-expiry time (minutes); None requires manual deactivation |
None
|
is_chaos_experiment
|
bool
|
Whether activation is from a chaos experiment |
False
|
experiment_id
|
str | None
|
Related chaos experiment ID |
None
|
override_kill_switch
|
bool
|
True to allow activation when kill switch is ON |
False
|
Returns:
| Type | Description |
|---|---|
EmergencyState
|
The new state |
activate_auto
activate_auto(
level: EmergencyLevel,
reason: str,
duration_minutes: int | None = None,
) -> EmergencyState
Automatically activate emergency mode (system-triggered).
Automated activations respect the kill switch unconditionally.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
EmergencyLevel
|
Emergency-mode level |
required |
reason
|
str
|
Auto-detection reason |
required |
duration_minutes
|
int | None
|
Auto-expiry time (default 30 minutes) |
None
|
Returns:
| Type | Description |
|---|---|
EmergencyState
|
The new state |
deactivate
deactivate(
deactivated_by: str,
reason: str = "",
force: bool = False,
) -> EmergencyState
Deactivate emergency mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
deactivated_by
|
str
|
User who deactivated it |
required |
reason
|
str
|
Deactivation reason (optional) |
''
|
force
|
bool
|
Force deactivation, ignoring recovery conditions |
False
|
Returns:
| Type | Description |
|---|---|
EmergencyState
|
The new state |
start_gradual_recovery
start_gradual_recovery(
initiated_by: str,
target_level: EmergencyLevel = EmergencyLevel.NORMAL,
) -> EmergencyState
Start gradual recovery.
Eases stepwise from the current level down to the target level. Proceeds after a metrics check at each step.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
initiated_by
|
str
|
User who started the recovery |
required |
target_level
|
EmergencyLevel
|
Target level (default: NORMAL) |
NORMAL
|
Returns:
| Type | Description |
|---|---|
EmergencyState
|
The current state |
stop_gradual_recovery
stop_gradual_recovery(
stopped_by: str, reason: str = ""
) -> EmergencyState
Stop gradual recovery.
get_history
get_history(limit: int = 50) -> list[dict[str, Any]]
Get the change history.
set_recovery_gate_config
set_recovery_gate_config(
config: RecoveryGateConfig, changed_by: str = "system"
)
Update the recovery gate configuration.
get_recovery_gate_config
get_recovery_gate_config() -> RecoveryGateConfig
Get the current recovery gate configuration.
reset
reset()
Reset state (for tests).
EmergencyState
dataclass
EmergencyState(
level: EmergencyLevel = EmergencyLevel.NORMAL,
is_active: bool = False,
activated_at: str | None = None,
activated_by: str | None = None,
activation_reason: str | None = None,
expires_at: str | None = None,
deactivated_at: str | None = None,
deactivated_by: str | None = None,
is_auto_triggered: bool = False,
is_recovering: bool = False,
recovery_started_at: str | None = None,
target_level: EmergencyLevel | None = None,
metadata: dict[str, Any] | None = None,
)
Bases: SerializableMixin
Emergency-mode state.
metadata
class-attribute
instance-attribute
metadata: dict[str, Any] | None = None
Additional metadata.
When activated by a chaos experiment
{ "is_chaos_experiment": True, "experiment_id": "exp-xxx", "classification": "chaos_induced_test" }
RecoveryGate
RecoveryGate(
config: RecoveryGateConfig | None = None,
metrics_checker: (
Callable[[], dict[str, float]] | None
) = None,
)
Recovery gate - manages safe deactivation of emergency mode.
Features: - Metric-based stability checks - Gradual recovery (stepwise easing per level) - Automatic rollback on recovery failure
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
RecoveryGateConfig | None
|
Recovery gate configuration |
None
|
metrics_checker
|
Callable[[], dict[str, float]] | None
|
Callback returning the current system metrics Return value: {"cpu_percent": 75.0, "error_rate": 0.02} |
None
|
check_recovery_allowed
check_recovery_allowed() -> tuple[bool, str]
Check whether recovery is allowed.
Returns:
| Type | Description |
|---|---|
(is_allowed, reason)
|
whether recovery is allowed and the reason |
get_next_recovery_level
get_next_recovery_level(
current_level: EmergencyLevel,
) -> EmergencyLevel | None
Return the next level in gradual recovery.
LEVEL_3 -> LEVEL_2 -> LEVEL_1 -> NORMAL
Returns:
| Type | Description |
|---|---|
EmergencyLevel | None
|
The next level, or None (when already NORMAL) |
get_emergency_manager
get_emergency_manager() -> GracefulDegradationManager
Get the emergency-mode manager singleton.
reset_emergency_manager
reset_emergency_manager() -> None
Reset the emergency manager (clears both module-level and class-level singletons, joins recovery thread).
is_emergency_active
is_emergency_active() -> bool
Check whether emergency mode is active (convenience function).
Usage
if is_emergency_active(): # Handle emergency mode ...
get_emergency_level
get_emergency_level() -> EmergencyLevel
Get the current emergency-mode level (convenience function).
Usage
level = get_emergency_level() if level >= EmergencyLevel.LEVEL_2: # Handle a severe emergency ...
get_tier_multiplier
get_tier_multiplier(tier_id: str) -> float
Get the tier multiplier for the current emergency mode (convenience function).
Usage
multiplier = get_tier_multiplier("standard") if random.random() > multiplier: # Load shedding return Response(status=503)