Web Tools¶

Cognitia provides pluggable web access through two ISP-compliant protocols: search (find information) and fetch (extract page content). Mix and match providers independently.

Architecture¶

HttpxWebProvider (WebProvider)
    ├── search_provider: WebSearchProvider (pluggable)
    │   ├── DuckDuckGoSearchProvider   # no API key, 9 search engines
    │   ├── BraveSearchProvider        # BRAVE_SEARCH_API_KEY
    │   ├── TavilySearchProvider       # TAVILY_API_KEY
    │   └── SearXNGSearchProvider      # self-hosted, no limits
    │
    └── fetch_provider: WebFetchProvider (pluggable)
        ├── (default) httpx + trafilatura  # built-in, no extra deps
        ├── JinaReaderFetchProvider    # JINA_API_KEY, markdown output
        └── Crawl4AIFetchProvider      # Playwright, JS-heavy sites

Protocols¶

WebSearchProvider¶

@runtime_checkable
class WebSearchProvider(Protocol):
    async def search(self, query: str, max_results: int = 5) -> list[SearchResult]: ...

WebFetchProvider¶

@runtime_checkable
class WebFetchProvider(Protocol):
    async def fetch(self, url: str) -> str: ...

SearchResult¶

@dataclass(frozen=True)
class SearchResult:
    title: str
    url: str
    snippet: str

Quick Start¶

Default (httpx only, no search)¶

from cognitia.tools.web_httpx import HttpxWebProvider

web = HttpxWebProvider(timeout=30)
# web.fetch(url) works (httpx + trafilatura)
# web.search(query) returns [] (no search provider)

With DuckDuckGo Search (no API key)¶

from cognitia.tools.web_httpx import HttpxWebProvider
from cognitia.tools.web_providers.duckduckgo import DuckDuckGoSearchProvider

web = HttpxWebProvider(
    timeout=30,
    search_provider=DuckDuckGoSearchProvider(timeout=15),
)

results = await web.search("Python async frameworks", max_results=5)
for r in results:
    print(f"{r.title}: {r.url}")
    print(f"  {r.snippet}")

With Jina Reader Fetch¶

from cognitia.tools.web_httpx import HttpxWebProvider
from cognitia.tools.web_providers.jina import JinaReaderFetchProvider

web = HttpxWebProvider(
    fetch_provider=JinaReaderFetchProvider(api_key="jina_..."),
)

content = await web.fetch("https://example.com/article")
print(content)  # Clean markdown with tables, code, links

Full Setup (search + fetch)¶

from cognitia.tools.web_httpx import HttpxWebProvider
from cognitia.tools.web_providers.tavily import TavilySearchProvider
from cognitia.tools.web_providers.crawl4ai import Crawl4AIFetchProvider

web = HttpxWebProvider(
    timeout=30,
    search_provider=TavilySearchProvider(api_key="tvly-..."),
    fetch_provider=Crawl4AIFetchProvider(timeout=30),
)

Using the Factory¶

Create providers by name (useful for configuration-driven setup):

from cognitia.tools.web_providers.factory import create_search_provider, create_fetch_provider

search = create_search_provider("duckduckgo")
search = create_search_provider("tavily", api_key="tvly-...")
search = create_search_provider("brave", api_key="BSA...")
search = create_search_provider("searxng", base_url="https://searx.example.com")

fetch = create_fetch_provider("default")    # None (use built-in httpx)
fetch = create_fetch_provider("jina", api_key="jina_...")
fetch = create_fetch_provider("crawl4ai")

Search Providers¶

DuckDuckGo¶

Meta-search across 9 engines (Bing, Google, Brave, Yandex, Yahoo, Mojeek, Wikipedia, Grokipedia). No API key required.

from cognitia.tools.web_providers.duckduckgo import DuckDuckGoSearchProvider

provider = DuckDuckGoSearchProvider(timeout=15)

Parameter	Type	Default	Description
`timeout`	`int`	`15`	Search timeout in seconds

Install: pip install cognitia[web-duckduckgo] Dependency: ddgs Rate limit: None (but may be throttled by DuckDuckGo)

Brave Search¶

Fast, privacy-focused search with free tier (2,000 requests/month).

from cognitia.tools.web_providers.brave import BraveSearchProvider

provider = BraveSearchProvider(api_key="BSA...", timeout=15)

Parameter	Type	Default	Description
`api_key`	`str`	required	Brave Search API key (`BRAVE_SEARCH_API_KEY`)
`timeout`	`int`	`15`	Request timeout in seconds

Install: pip install cognitia[web] (uses httpx) Rate limit: 2,000 req/month (free), higher on paid plans

Tavily¶

AI-optimized search designed specifically for LLM agents. Returns pre-processed, relevant content.

from cognitia.tools.web_providers.tavily import TavilySearchProvider

provider = TavilySearchProvider(api_key="tvly-...")

Parameter	Type	Default	Description
`api_key`	`str`	required	Tavily API key (`TAVILY_API_KEY`)

Install: pip install cognitia[web-tavily] Dependency: tavily-python Rate limit: 1,000 req/month (free)

SearXNG¶

Self-hosted meta-search engine. No API keys, no rate limits, full control.

from cognitia.tools.web_providers.searxng import SearXNGSearchProvider

provider = SearXNGSearchProvider(base_url="https://searx.example.com", timeout=15)

Parameter	Type	Default	Description
`base_url`	`str`	required	URL of your SearXNG instance (`SEARXNG_URL`)
`timeout`	`int`	`15`	Request timeout in seconds

Install: pip install cognitia[web] (uses httpx) Requirements: Self-hosted SearXNG instance with JSON API enabled

Fetch Providers¶

Default (httpx + trafilatura)¶

Built-in fetch using httpx for HTTP requests and trafilatura for content extraction. Falls back to regex-based extraction if trafilatura is not installed.

# No explicit provider needed — it's the default
web = HttpxWebProvider(timeout=30)
content = await web.fetch("https://example.com")

Extracts main content, strips navigation/ads/footers (via trafilatura)
Falls back to regex tag stripping without trafilatura
Content truncated to 50,000 characters

Install: pip install cognitia[web]

Jina Reader¶

Converts any URL to clean, LLM-friendly markdown via Jina AI Reader API. Supports tables, code blocks, LaTeX, 29 languages.

from cognitia.tools.web_providers.jina import JinaReaderFetchProvider

provider = JinaReaderFetchProvider(api_key="jina_...", timeout=30)

Parameter	Type	Default	Description
`api_key`	`str`	required	Jina API key (`JINA_API_KEY`)
`timeout`	`int`	`30`	Request timeout in seconds

Install: pip install cognitia[web-jina] Free tier: 1M tokens

Crawl4AI¶

Playwright-based crawler for JavaScript-heavy sites. Renders pages in a real browser before extracting content.

from cognitia.tools.web_providers.crawl4ai import Crawl4AIFetchProvider

provider = Crawl4AIFetchProvider(timeout=30)

Parameter	Type	Default	Description
`timeout`	`int`	`30`	Crawl timeout in seconds

Install: pip install cognitia[web-crawl4ai] Dependency: crawl4ai (includes Playwright)

Best for: SPAs, dynamic content, sites that require JavaScript rendering.

Comparison¶

Search Providers¶

Provider	API Key	Free Tier	Best For
DuckDuckGo	None	Unlimited	Quick start, no setup
Brave	Required	2,000/month	Fast, privacy-focused
Tavily	Required	1,000/month	AI-optimized results
SearXNG	None	Unlimited	Full control, self-hosted

Fetch Providers¶

Provider	API Key	Best For
Default (httpx)	None	Most websites, static content
Jina Reader	Required	Clean markdown, tables, LaTeX
Crawl4AI	None	JS-heavy sites, SPAs

Using with CognitiaStack¶

from cognitia.bootstrap.stack import CognitiaStack
from cognitia.tools.web_httpx import HttpxWebProvider
from cognitia.tools.web_providers.duckduckgo import DuckDuckGoSearchProvider

web = HttpxWebProvider(
    timeout=30,
    search_provider=DuckDuckGoSearchProvider(),
)

stack = CognitiaStack.create(
    prompts_dir=Path("prompts"),
    skills_dir=Path("skills"),
    project_root=Path("."),
    web_provider=web,
    # ... other config
)
# Agent now has: web_fetch(url) and web_search(query) tools

Writing a Custom Provider¶

Implement one of the protocols:

from cognitia.tools.web_protocols import WebSearchProvider, WebFetchProvider, SearchResult

class MySearchProvider:
    """Custom search provider — just implement the protocol."""

    async def search(self, query: str, max_results: int = 5) -> list[SearchResult]:
        # Your search logic here
        return [
            SearchResult(title="Result", url="https://...", snippet="...")
        ]

class MyFetchProvider:
    """Custom fetch provider — just implement the protocol."""

    async def fetch(self, url: str) -> str:
        # Your fetch logic here
        return "Page content as text or markdown"

Then plug it in:

web = HttpxWebProvider(
    search_provider=MySearchProvider(),
    fetch_provider=MyFetchProvider(),
)

Environment Variables¶

Variable	Provider	Description
`BRAVE_SEARCH_API_KEY`	Brave	API key for Brave Search
`TAVILY_API_KEY`	Tavily	API key for Tavily
`JINA_API_KEY`	Jina	API key for Jina Reader
`SEARXNG_URL`	SearXNG	URL of self-hosted SearXNG instance