Skip to content

Production Safety

Cognitia provides four complementary safety mechanisms for production deployments: cost budgets, guardrails, input filters, and retry/fallback policies. All are opt-in via RuntimeConfig — disabled by default for zero overhead.

Cost Budget Tracking

Track accumulated LLM costs and enforce spending limits.

from cognitia.runtime.cost import CostBudget
from cognitia.runtime.types import RuntimeConfig

config = RuntimeConfig(
    runtime_name="thin",
    cost_budget=CostBudget(
        max_cost_usd=5.0,           # USD spending cap
        max_total_tokens=1_000_000,  # token cap (optional)
        action_on_exceed="error",    # "error" (stop) or "warn" (continue)
    ),
)

How It Works

  • ThinRuntime creates a CostTracker at startup and records usage after each LLM call
  • Costs are computed using bundled pricing.json (updated with major model releases)
  • Unknown models fall back to _default pricing — no crashes on new models
  • When exceeded: emits RuntimeEvent with kind="budget_exceeded" (error mode) or continues with warning
  • Final event includes total_cost_usd when budget tracking is active

Bundled Pricing

Model Input $/1M tokens Output $/1M tokens
claude-sonnet-4-20250514 3.00 15.00
gpt-4o 2.50 10.00
gpt-4o-mini 0.15 0.60
gemini-2.0-flash 0.10 0.40
_default 3.00 15.00

CostBudget Fields

Field Type Default Description
max_cost_usd float \| None None Maximum total cost in USD. None disables cost limit.
max_total_tokens int \| None None Maximum total tokens (input + output). None disables token limit.
action_on_exceed "error" \| "warn" "error" "error" emits a budget_exceeded event and stops. "warn" continues with warning status.

Programmatic Access

from cognitia.runtime.cost import CostTracker, CostBudget, load_pricing

tracker = CostTracker(budget=budget, pricing=load_pricing())
tracker.record("gpt-4o", input_tokens=1000, output_tokens=500)

print(tracker.total_cost_usd)    # accumulated cost
print(tracker.total_tokens)      # accumulated tokens
print(tracker.check_budget())    # "ok" | "warning" | "exceeded"
tracker.reset()                  # zero all counters

Custom Pricing

Override bundled pricing by passing a custom dict[str, ModelPricing] to CostTracker:

from cognitia.runtime.cost import CostTracker, CostBudget, ModelPricing

custom_pricing = {
    "my-fine-tuned-model": ModelPricing(input_per_1m=5.0, output_per_1m=20.0),
    "_default": ModelPricing(input_per_1m=3.0, output_per_1m=15.0),
}

tracker = CostTracker(
    budget=CostBudget(max_cost_usd=10.0),
    pricing=custom_pricing,
)

ModelPricing is a frozen dataclass with two fields: input_per_1m and output_per_1m (USD per 1 million tokens). When CostTracker.record() encounters an unknown model, it falls back to the _default key. If no _default is present and the model is unknown, the call is silently ignored (no cost recorded).

The load_pricing() function loads the bundled pricing.json via importlib.resources, making it reliable inside installed packages.


Guardrails

Pre- and post-LLM content checks. Input guardrails run before the LLM call; output guardrails run after. A failed guardrail emits an error event with kind="guardrail_tripwire".

from cognitia.guardrails import (
    ContentLengthGuardrail,
    RegexGuardrail,
    CallerAllowlistGuardrail,
)
from cognitia.runtime.types import RuntimeConfig

config = RuntimeConfig(
    runtime_name="thin",
    input_guardrails=[
        ContentLengthGuardrail(max_length=8000),
        RegexGuardrail(patterns=[r"ignore previous instructions"]),
    ],
    output_guardrails=[
        RegexGuardrail(
            patterns=[r"SECRET_\d+"],
            reason="Sensitive data leaked in response",
        ),
    ],
)

Built-in Guardrails

Guardrail Description
ContentLengthGuardrail Rejects text longer than max_length characters (default: 100,000)
RegexGuardrail Rejects text matching any of the given regex patterns
CallerAllowlistGuardrail Rejects requests from session_id not in the allowlist

Custom Guardrails

Implement the Guardrail protocol:

from cognitia.guardrails import GuardrailContext, GuardrailResult

class ToxicityGuardrail:
    async def check(self, ctx: GuardrailContext, text: str) -> GuardrailResult:
        if is_toxic(text):
            return GuardrailResult(passed=False, reason="Toxic content detected")
        return GuardrailResult(passed=True)

Execution Model

  • All guardrails run in parallel via asyncio.gather — N guardrails don't add linear latency
  • First failure stops execution and emits an error event
  • tripwire=True in GuardrailResult marks a hard, non-recoverable failure

Input Filters

Transform messages and system prompt before each LLM call. Filters are applied sequentially in list order.

from cognitia.input_filters import MaxTokensFilter, SystemPromptInjector
from cognitia.runtime.types import RuntimeConfig

config = RuntimeConfig(
    runtime_name="thin",
    input_filters=[
        SystemPromptInjector(
            extra_text="Always reply in English.",
            position="prepend",  # or "append"
        ),
        MaxTokensFilter(max_tokens=64_000),
    ],
)

Built-in Filters

Filter Description
MaxTokensFilter Trims older messages to fit within max_tokens budget. Always preserves system prompt and the last message. Token estimation: len(text) / chars_per_token (default 4.0).
SystemPromptInjector Prepends or appends text to the system prompt.

InputFilter Protocol

All filters implement the InputFilter protocol from cognitia.input_filters:

from cognitia.input_filters import InputFilter
from cognitia.runtime.types import Message

class RedactFilter:
    async def filter(
        self, messages: list[Message], system_prompt: str
    ) -> tuple[list[Message], str]:
        cleaned = [redact_pii(m) for m in messages]
        return cleaned, system_prompt

Filters are applied sequentially in list order. Each filter receives the output of the previous one, forming a pipeline. The final (messages, system_prompt) tuple is passed to the LLM call.


Retry / Fallback Policy

Automatic retry with exponential backoff when LLM calls fail.

from cognitia.retry import ExponentialBackoff
from cognitia.runtime.types import RuntimeConfig

config = RuntimeConfig(
    runtime_name="thin",
    retry_policy=ExponentialBackoff(
        max_retries=3,       # up to 3 retries (4 total attempts)
        base_delay=1.0,      # seconds
        max_delay=60.0,      # cap
        jitter=True,         # random factor 0.5-1.5x
    ),
)

Delay Formula

delay = min(base_delay * 2^attempt, max_delay) * uniform(0.5, 1.5)

Model Fallback Chain

Switch to a backup model when the primary fails:

from cognitia.retry import ModelFallbackChain

chain = ModelFallbackChain(models=["gpt-4o", "claude-sonnet-4-20250514", "gemini-2.0-flash"])
next_model = chain.next_model("gpt-4o")  # "claude-sonnet-4-20250514"

Provider Fallback

Switch to an entirely different provider when the primary is down:

from cognitia.retry import ProviderFallback

fb = ProviderFallback(fallback_model="openai:gpt-4o")
# Use fb.fallback_model as the target when the primary provider returns errors

ProviderFallback is a frozen dataclass with a single field (fallback_model: str). It is intended to be used alongside ModelFallbackChain for two-level resilience: first try alternative models within the same provider, then fail over to a different provider entirely.

RetryPolicy Protocol

All retry strategies implement the RetryPolicy protocol:

from cognitia.retry import RetryPolicy

class MyRetryPolicy:
    def should_retry(self, error: Exception, attempt: int) -> tuple[bool, float]:
        """Return (should_retry, delay_seconds). attempt is zero-based."""
        if attempt < 2 and "rate_limit" in str(error):
            return True, 5.0
        return False, 0.0

The attempt parameter is zero-based (0 = first retry candidate). When should_retry returns False, the delay value is ignored.


Data Flow

The complete request pipeline with all safety mechanisms:

User Input
Input Filters (sequential: SystemPromptInjector → MaxTokensFilter → RagInputFilter)
Input Guardrails (parallel, asyncio.gather)
    │  fail → error event, kind="guardrail_tripwire"
    ▼  pass
LLM Call
    │  error → RetryPolicy.should_retry → retry loop or error event
    ▼  success
Output Guardrails (parallel)
    │  fail → error event, kind="guardrail_tripwire"
    ▼  pass
CostTracker.record → check_budget
    │  exceeded → budget_exceeded event (if action="error")
    ▼  ok
Final RuntimeEvent