Web Tools¶
Cognitia provides pluggable web access through two ISP-compliant protocols: search (find information) and fetch (extract page content). Mix and match providers independently.
Architecture¶
HttpxWebProvider (WebProvider)
├── search_provider: WebSearchProvider (pluggable)
│ ├── DuckDuckGoSearchProvider # no API key, 9 search engines
│ ├── BraveSearchProvider # BRAVE_SEARCH_API_KEY
│ ├── TavilySearchProvider # TAVILY_API_KEY
│ └── SearXNGSearchProvider # self-hosted, no limits
│
└── fetch_provider: WebFetchProvider (pluggable)
├── (default) httpx + trafilatura # built-in, no extra deps
├── JinaReaderFetchProvider # JINA_API_KEY, markdown output
└── Crawl4AIFetchProvider # Playwright, JS-heavy sites
Protocols¶
WebSearchProvider¶
@runtime_checkable
class WebSearchProvider(Protocol):
async def search(self, query: str, max_results: int = 5) -> list[SearchResult]: ...
WebFetchProvider¶
SearchResult¶
Quick Start¶
Default (httpx only, no search)¶
from cognitia.tools.web_httpx import HttpxWebProvider
web = HttpxWebProvider(timeout=30)
# web.fetch(url) works (httpx + trafilatura)
# web.search(query) returns [] (no search provider)
With DuckDuckGo Search (no API key)¶
from cognitia.tools.web_httpx import HttpxWebProvider
from cognitia.tools.web_providers.duckduckgo import DuckDuckGoSearchProvider
web = HttpxWebProvider(
timeout=30,
search_provider=DuckDuckGoSearchProvider(timeout=15),
)
results = await web.search("Python async frameworks", max_results=5)
for r in results:
print(f"{r.title}: {r.url}")
print(f" {r.snippet}")
With Jina Reader Fetch¶
from cognitia.tools.web_httpx import HttpxWebProvider
from cognitia.tools.web_providers.jina import JinaReaderFetchProvider
web = HttpxWebProvider(
fetch_provider=JinaReaderFetchProvider(api_key="jina_..."),
)
content = await web.fetch("https://example.com/article")
print(content) # Clean markdown with tables, code, links
Full Setup (search + fetch)¶
from cognitia.tools.web_httpx import HttpxWebProvider
from cognitia.tools.web_providers.tavily import TavilySearchProvider
from cognitia.tools.web_providers.crawl4ai import Crawl4AIFetchProvider
web = HttpxWebProvider(
timeout=30,
search_provider=TavilySearchProvider(api_key="tvly-..."),
fetch_provider=Crawl4AIFetchProvider(timeout=30),
)
Using the Factory¶
Create providers by name (useful for configuration-driven setup):
from cognitia.tools.web_providers.factory import create_search_provider, create_fetch_provider
search = create_search_provider("duckduckgo")
search = create_search_provider("tavily", api_key="tvly-...")
search = create_search_provider("brave", api_key="BSA...")
search = create_search_provider("searxng", base_url="https://searx.example.com")
fetch = create_fetch_provider("default") # None (use built-in httpx)
fetch = create_fetch_provider("jina", api_key="jina_...")
fetch = create_fetch_provider("crawl4ai")
Search Providers¶
DuckDuckGo¶
Meta-search across 9 engines (Bing, Google, Brave, Yandex, Yahoo, Mojeek, Wikipedia, Grokipedia). No API key required.
from cognitia.tools.web_providers.duckduckgo import DuckDuckGoSearchProvider
provider = DuckDuckGoSearchProvider(timeout=15)
| Parameter | Type | Default | Description |
|---|---|---|---|
timeout | int | 15 | Search timeout in seconds |
Install: pip install cognitia[web-duckduckgo] Dependency: ddgs Rate limit: None (but may be throttled by DuckDuckGo)
Brave Search¶
Fast, privacy-focused search with free tier (2,000 requests/month).
from cognitia.tools.web_providers.brave import BraveSearchProvider
provider = BraveSearchProvider(api_key="BSA...", timeout=15)
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | required | Brave Search API key (BRAVE_SEARCH_API_KEY) |
timeout | int | 15 | Request timeout in seconds |
Install: pip install cognitia[web] (uses httpx) Rate limit: 2,000 req/month (free), higher on paid plans
Tavily¶
AI-optimized search designed specifically for LLM agents. Returns pre-processed, relevant content.
from cognitia.tools.web_providers.tavily import TavilySearchProvider
provider = TavilySearchProvider(api_key="tvly-...")
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | required | Tavily API key (TAVILY_API_KEY) |
Install: pip install cognitia[web-tavily] Dependency: tavily-python Rate limit: 1,000 req/month (free)
SearXNG¶
Self-hosted meta-search engine. No API keys, no rate limits, full control.
from cognitia.tools.web_providers.searxng import SearXNGSearchProvider
provider = SearXNGSearchProvider(base_url="https://searx.example.com", timeout=15)
| Parameter | Type | Default | Description |
|---|---|---|---|
base_url | str | required | URL of your SearXNG instance (SEARXNG_URL) |
timeout | int | 15 | Request timeout in seconds |
Install: pip install cognitia[web] (uses httpx) Requirements: Self-hosted SearXNG instance with JSON API enabled
Fetch Providers¶
Default (httpx + trafilatura)¶
Built-in fetch using httpx for HTTP requests and trafilatura for content extraction. Falls back to regex-based extraction if trafilatura is not installed.
# No explicit provider needed — it's the default
web = HttpxWebProvider(timeout=30)
content = await web.fetch("https://example.com")
- Extracts main content, strips navigation/ads/footers (via trafilatura)
- Falls back to regex tag stripping without trafilatura
- Content truncated to 50,000 characters
Install: pip install cognitia[web]
Jina Reader¶
Converts any URL to clean, LLM-friendly markdown via Jina AI Reader API. Supports tables, code blocks, LaTeX, 29 languages.
from cognitia.tools.web_providers.jina import JinaReaderFetchProvider
provider = JinaReaderFetchProvider(api_key="jina_...", timeout=30)
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | required | Jina API key (JINA_API_KEY) |
timeout | int | 30 | Request timeout in seconds |
Install: pip install cognitia[web-jina] Free tier: 1M tokens
Crawl4AI¶
Playwright-based crawler for JavaScript-heavy sites. Renders pages in a real browser before extracting content.
from cognitia.tools.web_providers.crawl4ai import Crawl4AIFetchProvider
provider = Crawl4AIFetchProvider(timeout=30)
| Parameter | Type | Default | Description |
|---|---|---|---|
timeout | int | 30 | Crawl timeout in seconds |
Install: pip install cognitia[web-crawl4ai] Dependency: crawl4ai (includes Playwright)
Best for: SPAs, dynamic content, sites that require JavaScript rendering.
Comparison¶
Search Providers¶
| Provider | API Key | Free Tier | Best For |
|---|---|---|---|
| DuckDuckGo | None | Unlimited | Quick start, no setup |
| Brave | Required | 2,000/month | Fast, privacy-focused |
| Tavily | Required | 1,000/month | AI-optimized results |
| SearXNG | None | Unlimited | Full control, self-hosted |
Fetch Providers¶
| Provider | API Key | Best For |
|---|---|---|
| Default (httpx) | None | Most websites, static content |
| Jina Reader | Required | Clean markdown, tables, LaTeX |
| Crawl4AI | None | JS-heavy sites, SPAs |
Using with CognitiaStack¶
from cognitia.bootstrap.stack import CognitiaStack
from cognitia.tools.web_httpx import HttpxWebProvider
from cognitia.tools.web_providers.duckduckgo import DuckDuckGoSearchProvider
web = HttpxWebProvider(
timeout=30,
search_provider=DuckDuckGoSearchProvider(),
)
stack = CognitiaStack.create(
prompts_dir=Path("prompts"),
skills_dir=Path("skills"),
project_root=Path("."),
web_provider=web,
# ... other config
)
# Agent now has: web_fetch(url) and web_search(query) tools
Writing a Custom Provider¶
Implement one of the protocols:
from cognitia.tools.web_protocols import WebSearchProvider, WebFetchProvider, SearchResult
class MySearchProvider:
"""Custom search provider — just implement the protocol."""
async def search(self, query: str, max_results: int = 5) -> list[SearchResult]:
# Your search logic here
return [
SearchResult(title="Result", url="https://...", snippet="...")
]
class MyFetchProvider:
"""Custom fetch provider — just implement the protocol."""
async def fetch(self, url: str) -> str:
# Your fetch logic here
return "Page content as text or markdown"
Then plug it in:
Environment Variables¶
| Variable | Provider | Description |
|---|---|---|
BRAVE_SEARCH_API_KEY | Brave | API key for Brave Search |
TAVILY_API_KEY | Tavily | API key for Tavily |
JINA_API_KEY | Jina | API key for Jina Reader |
SEARXNG_URL | SearXNG | URL of self-hosted SearXNG instance |