Production Safety¶
Cognitia provides four complementary safety mechanisms for production deployments: cost budgets, guardrails, input filters, and retry/fallback policies. All are opt-in via RuntimeConfig — disabled by default for zero overhead.
Cost Budget Tracking¶
Track accumulated LLM costs and enforce spending limits.
from cognitia.runtime.cost import CostBudget
from cognitia.runtime.types import RuntimeConfig
config = RuntimeConfig(
runtime_name="thin",
cost_budget=CostBudget(
max_cost_usd=5.0, # USD spending cap
max_total_tokens=1_000_000, # token cap (optional)
action_on_exceed="error", # "error" (stop) or "warn" (continue)
),
)
How It Works¶
- ThinRuntime creates a
CostTrackerat startup and records usage after each LLM call - Costs are computed using bundled
pricing.json(updated with major model releases) - Unknown models fall back to
_defaultpricing — no crashes on new models - When exceeded: emits
RuntimeEventwithkind="budget_exceeded"(error mode) or continues with warning - Final event includes
total_cost_usdwhen budget tracking is active
Bundled Pricing¶
| Model | Input $/1M tokens | Output $/1M tokens |
|---|---|---|
| claude-sonnet-4-20250514 | 3.00 | 15.00 |
| gpt-4o | 2.50 | 10.00 |
| gpt-4o-mini | 0.15 | 0.60 |
| gemini-2.0-flash | 0.10 | 0.40 |
_default | 3.00 | 15.00 |
CostBudget Fields¶
| Field | Type | Default | Description |
|---|---|---|---|
max_cost_usd | float \| None | None | Maximum total cost in USD. None disables cost limit. |
max_total_tokens | int \| None | None | Maximum total tokens (input + output). None disables token limit. |
action_on_exceed | "error" \| "warn" | "error" | "error" emits a budget_exceeded event and stops. "warn" continues with warning status. |
Programmatic Access¶
from cognitia.runtime.cost import CostTracker, CostBudget, load_pricing
tracker = CostTracker(budget=budget, pricing=load_pricing())
tracker.record("gpt-4o", input_tokens=1000, output_tokens=500)
print(tracker.total_cost_usd) # accumulated cost
print(tracker.total_tokens) # accumulated tokens
print(tracker.check_budget()) # "ok" | "warning" | "exceeded"
tracker.reset() # zero all counters
Custom Pricing¶
Override bundled pricing by passing a custom dict[str, ModelPricing] to CostTracker:
from cognitia.runtime.cost import CostTracker, CostBudget, ModelPricing
custom_pricing = {
"my-fine-tuned-model": ModelPricing(input_per_1m=5.0, output_per_1m=20.0),
"_default": ModelPricing(input_per_1m=3.0, output_per_1m=15.0),
}
tracker = CostTracker(
budget=CostBudget(max_cost_usd=10.0),
pricing=custom_pricing,
)
ModelPricing is a frozen dataclass with two fields: input_per_1m and output_per_1m (USD per 1 million tokens). When CostTracker.record() encounters an unknown model, it falls back to the _default key. If no _default is present and the model is unknown, the call is silently ignored (no cost recorded).
The load_pricing() function loads the bundled pricing.json via importlib.resources, making it reliable inside installed packages.
Guardrails¶
Pre- and post-LLM content checks. Input guardrails run before the LLM call; output guardrails run after. A failed guardrail emits an error event with kind="guardrail_tripwire".
from cognitia.guardrails import (
ContentLengthGuardrail,
RegexGuardrail,
CallerAllowlistGuardrail,
)
from cognitia.runtime.types import RuntimeConfig
config = RuntimeConfig(
runtime_name="thin",
input_guardrails=[
ContentLengthGuardrail(max_length=8000),
RegexGuardrail(patterns=[r"ignore previous instructions"]),
],
output_guardrails=[
RegexGuardrail(
patterns=[r"SECRET_\d+"],
reason="Sensitive data leaked in response",
),
],
)
Built-in Guardrails¶
| Guardrail | Description |
|---|---|
ContentLengthGuardrail | Rejects text longer than max_length characters (default: 100,000) |
RegexGuardrail | Rejects text matching any of the given regex patterns |
CallerAllowlistGuardrail | Rejects requests from session_id not in the allowlist |
Custom Guardrails¶
Implement the Guardrail protocol:
from cognitia.guardrails import GuardrailContext, GuardrailResult
class ToxicityGuardrail:
async def check(self, ctx: GuardrailContext, text: str) -> GuardrailResult:
if is_toxic(text):
return GuardrailResult(passed=False, reason="Toxic content detected")
return GuardrailResult(passed=True)
Execution Model¶
- All guardrails run in parallel via
asyncio.gather— N guardrails don't add linear latency - First failure stops execution and emits an error event
tripwire=TrueinGuardrailResultmarks a hard, non-recoverable failure
Input Filters¶
Transform messages and system prompt before each LLM call. Filters are applied sequentially in list order.
from cognitia.input_filters import MaxTokensFilter, SystemPromptInjector
from cognitia.runtime.types import RuntimeConfig
config = RuntimeConfig(
runtime_name="thin",
input_filters=[
SystemPromptInjector(
extra_text="Always reply in English.",
position="prepend", # or "append"
),
MaxTokensFilter(max_tokens=64_000),
],
)
Built-in Filters¶
| Filter | Description |
|---|---|
MaxTokensFilter | Trims older messages to fit within max_tokens budget. Always preserves system prompt and the last message. Token estimation: len(text) / chars_per_token (default 4.0). |
SystemPromptInjector | Prepends or appends text to the system prompt. |
InputFilter Protocol¶
All filters implement the InputFilter protocol from cognitia.input_filters:
from cognitia.input_filters import InputFilter
from cognitia.runtime.types import Message
class RedactFilter:
async def filter(
self, messages: list[Message], system_prompt: str
) -> tuple[list[Message], str]:
cleaned = [redact_pii(m) for m in messages]
return cleaned, system_prompt
Filters are applied sequentially in list order. Each filter receives the output of the previous one, forming a pipeline. The final (messages, system_prompt) tuple is passed to the LLM call.
Retry / Fallback Policy¶
Automatic retry with exponential backoff when LLM calls fail.
from cognitia.retry import ExponentialBackoff
from cognitia.runtime.types import RuntimeConfig
config = RuntimeConfig(
runtime_name="thin",
retry_policy=ExponentialBackoff(
max_retries=3, # up to 3 retries (4 total attempts)
base_delay=1.0, # seconds
max_delay=60.0, # cap
jitter=True, # random factor 0.5-1.5x
),
)
Delay Formula¶
Model Fallback Chain¶
Switch to a backup model when the primary fails:
from cognitia.retry import ModelFallbackChain
chain = ModelFallbackChain(models=["gpt-4o", "claude-sonnet-4-20250514", "gemini-2.0-flash"])
next_model = chain.next_model("gpt-4o") # "claude-sonnet-4-20250514"
Provider Fallback¶
Switch to an entirely different provider when the primary is down:
from cognitia.retry import ProviderFallback
fb = ProviderFallback(fallback_model="openai:gpt-4o")
# Use fb.fallback_model as the target when the primary provider returns errors
ProviderFallback is a frozen dataclass with a single field (fallback_model: str). It is intended to be used alongside ModelFallbackChain for two-level resilience: first try alternative models within the same provider, then fail over to a different provider entirely.
RetryPolicy Protocol¶
All retry strategies implement the RetryPolicy protocol:
from cognitia.retry import RetryPolicy
class MyRetryPolicy:
def should_retry(self, error: Exception, attempt: int) -> tuple[bool, float]:
"""Return (should_retry, delay_seconds). attempt is zero-based."""
if attempt < 2 and "rate_limit" in str(error):
return True, 5.0
return False, 0.0
The attempt parameter is zero-based (0 = first retry candidate). When should_retry returns False, the delay value is ignored.
Data Flow¶
The complete request pipeline with all safety mechanisms:
User Input
│
▼
Input Filters (sequential: SystemPromptInjector → MaxTokensFilter → RagInputFilter)
│
▼
Input Guardrails (parallel, asyncio.gather)
│ fail → error event, kind="guardrail_tripwire"
▼ pass
LLM Call
│ error → RetryPolicy.should_retry → retry loop or error event
▼ success
Output Guardrails (parallel)
│ fail → error event, kind="guardrail_tripwire"
▼ pass
CostTracker.record → check_budget
│ exceeded → budget_exceeded event (if action="error")
▼ ok
Final RuntimeEvent