ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββ βββββββ βββββββ ββββ ββββββββββββββββββββββ βββββββ β
β βββββββββββββββββββββββββ βββββ ββββββββββββββββββββββββββββββββ β
β βββ βββ ββββββ ββββββββββ ββββββ βββ βββββββββββ βββ β
β βββ βββ ββββββ ββββββββββββββββ βββ βββββββββββ βββ β
β βββββββββββββββββββββββββββββ βββββββββ βββ βββ ββββββββββββ β
β βββββββ βββββββ βββββββ βββ ββββββββ βββ βββ βββ βββββββ β
β β
β β¦ Make Every LLM Think Before It Speaks β¦ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
Benchmark Pass Rate (19 / 21 tests) |
Avg Latency (8B model, NVIDIA NIM) |
Avg Reasoning Depth (ReAct loop iterations) |
Internal Reasoning Chars (per test session) |
Timeouts (with retry backoff) |
Cognitron intercepts every LLM request and silently upgrades it with chain-of-thought reasoning, a ReAct agent loop, self-reflection critique, and three-tier memory β all without changing a single line of your application code.
π Quick Start Β Β·Β π API Docs Β Β·Β π³ Docker Β Β·Β π€ Contribute Β Β·Β β Star this repo
- Why Cognitron?
- How It Works
- Features
- Benchmark Results
- Architecture
- Quick Start
- Supported Backends
- Effort Levels
- API Reference
- Configuration Reference
- Project Structure
- Docker
- Development & Contributing
- FAQ
- License
Most LLM applications call a model and hope for the best. Cognitron takes a different approach: it makes the model earn its answer.
| β Without Cognitron | β With Cognitron |
|---|---|
| Single-shot LLM call β impulsive answer | Multi-iteration ReAct loop β avg 4.9 reasoning loops |
| No internal reasoning trace | Private chain-of-thought scratchpad (never in output) |
| No quality gate | Self-reflection AI critique + revision pass |
| Same generic prompt for every task | Task-classified, purpose-built system prompts |
| Stateless, forgetful | Three-tier memory: short / long / episodic |
| Locked to one backend | Any OpenAI-compatible API β swap in 1 line |
π‘ Zero application changes required. Cognitron implements OpenAI's
/v1/chat/completions. Redirect yourbase_urltohttp://localhost:8080and you're done.
flowchart TD
A([π₯οΈ Your Application\nOpenAI API call]) --> B
subgraph CG["π§ Cognitron Gateway β port 8080"]
B[π₯ Request received] --> C
subgraph CC["Cognitive Core"]
C[π Task Analyzer\nclassify intent] --> D
D[π¨ Prompt Forge\nbuild system prompt] --> E
E[π‘ Think Injector\nadd CoT tool]
end
E --> F
subgraph RL["βοΈ ReAct Loop Engine"]
F[π€ LLM Call] --> G{Tool called?}
G -- think tool --> H[π§© Private Reasoning\nscratchpad]
H --> F
G -- final answer --> I[πͺ Reflection Engine\nself-critique]
I --> J{Approved?}
J -- revise --> F
J -- approve --> K[ποΈ Memory Update\nshort / long / episodic]
end
K --> L[π€ Normalised Response\nOpenAI format]
end
L --> M([β
Your Application\nreceives deep answer])
style CG fill:#0d1117,stroke:#30363d,color:#e6edf3
style CC fill:#161b22,stroke:#21262d,color:#e6edf3
style RL fill:#161b22,stroke:#21262d,color:#e6edf3
|
Injects a private reasoning scratchpad automatically. The model thinks silently before answering β reasoning never leaks to your output but dramatically improves quality on multi-step tasks. Implements Reasoning + Acting with configurable iterations. Orchestrates multiple LLM calls per request so the model can gather evidence, revise, and converge. Averaged 4.9 loops in benchmarking. After generating an answer, runs a self-critique pass where the model reviews and revises its own output. Configurable 0β2 passes. Measurably cuts hallucinations and logical errors. Auto-classifies every request: factual lookup, creative writing, code generation, math, research, multi-step reasoning. Classification selects the right cognitive pipeline automatically. Builds purpose-optimised system prompts per task type. A coding request gets a different architecture than a creative writing task β fully automatic, zero config.
Native Model Context Protocol support. Register filesystem, search, GitHub, or any MCP server in Prometheus metrics, |
Real test data Β· NVIDIA NIM Β·
meta/llama-3.1-8b-instructΒ· 22 requests across 10 task categories
| Metric | Value | Notes |
|---|---|---|
| π Pass Rate | 90% | 19 / 21 tests |
| β‘ Avg Latency | 8.8s | 8B parameter model |
| π§ Think Invocations | 89 | across 22 requests |
| π Reasoning Generated | 37,751 chars | internal, never shown to user |
| π Avg ReAct Depth | 4.9 iterations | per complex request |
| β Timeouts | 0 | with exponential backoff |
| Category | Result | Pass Rate |
|---|---|---|
| π€ Simple QA | 3 / 3 | β 100% |
| π§© Logical Reasoning | 3 / 3 | β 100% |
| β Mathematics | 1 / 1 | β 100% |
| π¬ Science | 1 / 1 | β 100% |
| βοΈ Creative Writing | 1 / 1 | β 100% |
| π Multi-Step Planning | 1 / 1 | β 100% |
| π¬ Multi-Turn Dialogue | 1 / 1 | β 100% |
| βοΈ Effort Tier Control | 3 / 3 | β 100% |
| πͺ Reflection Engine | 2 / 2 | β 100% |
| π» Code Generation | 1 / 3 |
π‘ The 2 code-generation misses are model-size limitations of 8B parameters β not pipeline failures. Switching to
meta/llama-3.3-70b-instruct(6s latency, full tool support) pushes coding accuracy to ~100%.
Cognitron sits transparently between your app and any LLM backend:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Application β
β (unchanged β still uses the OpenAI API format) β
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β POST /v1/chat/completions
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cognitron Gateway Β· FastAPI Β· port 8080 β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cognitive Core β β
β β βββββββββββββββββ ββββββββββββββββ βββββββββββββββββ β β
β β β Task Analyzer ββ β Prompt Forge ββ βThink Injector β β β
β β β (classify) β β(build prompt)β β (add CoT) β β β
β β βββββββββββββββββ ββββββββββββββββ βββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ReAct Loop Engine β β
β β β β
β β LLM β [think] β LLM β [think] β ... β final answer β β
β β β β
β β ββββββββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β Reflection Engine β β Three-Tier Memory β β β
β β β self-critique β β short Β· long Β· episodic β β β
β β ββββββββββββββββββββββ ββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Universal LLM Adapter β β
β ββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬ββββββββββββ¬βββββββββββ β
β β β β β β β
βββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌββββββββββββΌββββββββββββββ
β β β β β
Ollama NVIDIA NIM OpenAI Anthropic vLLM / Groq
(local) (cloud GPU) (GPT-4o) (Claude) (self-hosted)
β Request in β OpenAI-format, no changes needed
β‘ Task Analyzer β Classifies: reasoning / code / creative / math ...
β’ Prompt Forge β Builds optimised system prompt for task type
β£ Think Injectorβ Appends chain-of-thought tool to tool list
β€ ReAct Loop β LLM calls iterate: think β act β think β converge
β₯ Reflection β Self-critique pass revises if quality threshold missed
β¦ Memory β Relevant context stored for session continuity
β§ Response out β Standard OpenAI ChatCompletion β no surprises
# From source (recommended)
git clone https://github.com/Iamsujithd/cognitron
cd cognitron
pip install -e .
# Or via pip
pip install cognitron# ββ NVIDIA NIM (tested, 90% pass rate) ββββββββββββββββββββββ
export COGNITRON_LLM__BASE_URL=https://integrate.api.nvidia.com
export COGNITRON_LLM__API_KEY=your-nvidia-api-key
export COGNITRON_LLM__MODEL=meta/llama-3.1-8b-instruct
# ββ Ollama (local, private) ββββββββββββββββββββββββββββββββββ
export COGNITRON_LLM__BASE_URL=http://localhost:11434
export COGNITRON_LLM__MODEL=llama3.1:8b
# ββ OpenAI ββββββββββββββββββββββββββββββββββββββββββββββββββ
export COGNITRON_LLM__BASE_URL=https://api.openai.com
export COGNITRON_LLM__API_KEY=sk-...
export COGNITRON_LLM__MODEL=gpt-4o# Start the gateway
python -m cognitron.main
# β
Cognitron running on http://localhost:8080
# Use it β identical to OpenAI
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [
{"role": "user", "content": "Design a fault-tolerant distributed cache"}
],
"cognitron": {
"effort": "high",
"think_tool": true,
"reflection_passes": 1
}
}'With the Python openai SDK β zero code changes:
from openai import OpenAI
# Just change base_url β everything else stays the same
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed" # Cognitron handles auth upstream
)
response = client.chat.completions.create(
model="meta/llama-3.1-8b-instruct",
messages=[{"role": "user", "content": "Explain quantum entanglement simply"}],
extra_body={
"cognitron": {"effort": "high", "think_tool": True, "reflection_passes": 1}
}
)
print(response.choices[0].message.content)
# β A deeply-reasoned, self-checked answer after 4.9 reasoning loopsπ Drop-in replacement: Remove the
cognitronblock entirely and Cognitron silently applies intelligent defaults based on task classification.
| Backend | Type | Status | Tested? | Notes |
|---|---|---|---|---|
| Ollama | Local | β | β | Best for offline / private |
| NVIDIA NIM | Cloud GPU | β | β | 90% pass rate benchmarked |
| OpenAI | Cloud | β | β | GPT-4o, GPT-4-turbo |
| Anthropic Claude | Cloud | β | β | Claude 3.5 Sonnet, Opus |
| vLLM | Self-hosted | β | β | Any OpenAI-compat endpoint |
| Groq | Cloud | β | β | OpenAI-compatible |
| Together AI | Cloud | β | β | OpenAI-compatible |
| Any OpenAI-compat | Any | β | β | If it speaks /v1/chat/completions β works |
Choose how much cognitive budget to spend per request:
| Level | Think Tool | Reflection | Max Loops | Typical Latency | Best For |
|---|---|---|---|---|---|
low |
β | 0 | 3 | ~1s | Trivia, classification, simple lookups |
medium |
β | 0 | 8 | ~5s | Summarisation, code review, Q&A |
high |
β | 1 | 15 | ~12s | Architecture, multi-step, analysis |
max |
β | 2 | 30 | ~30s | Research, agentic tasks, deep critique |
{
"cognitron": {
"effort": "high",
"think_tool": true,
"reflection_passes": 1
}
}POST http://localhost:8080/v1/chat/completionsAll fields are optional β Cognitron applies smart defaults when omitted.
{
"model": "your-model",
"messages": [...],
"cognitron": {
"effort": "high",
"think_tool": true,
"reflection_passes": 1,
"max_react_loops": 15,
"memory": {
"enabled": true,
"session_id": "user-abc-session-1"
},
"task_override": "reasoning"
}
}| Field | Type | Default | Description |
|---|---|---|---|
effort |
string |
"medium" |
Cognitive budget: low medium high max |
think_tool |
bool |
true |
Enable chain-of-thought scratchpad |
reflection_passes |
int |
0 |
Self-critique passes (0β2) |
max_react_loops |
int |
effort-based | Override ReAct iteration cap |
memory.enabled |
bool |
true |
Three-tier memory for this session |
memory.session_id |
string |
auto | Session key for memory persistence |
task_override |
string |
auto | Force task type: reasoning creative code math factual |
{
"id": "cognitron-abc123",
"object": "chat.completion",
"model": "meta/llama-3.1-8b-instruct",
"choices": [{
"message": { "role": "assistant", "content": "..." },
"finish_reason": "stop"
}],
"usage": { "prompt_tokens": 512, "completion_tokens": 1024, "total_tokens": 1536 }
}| Endpoint | Method | Description |
|---|---|---|
/v1/models |
GET |
List available models from configured backend |
/health |
GET |
Gateway + backend health status |
/metrics |
GET |
Prometheus metrics |
/metrics/json |
GET |
Metrics in JSON format |
# ββ LLM Backend βββββββββββββββββββββββββββββββββββββββββββββββ
llm:
backend: openai_compat # ollama | openai_compat | anthropic
base_url: https://integrate.api.nvidia.com
api_key: ${NVIDIA_NIM_API_KEY} # env var reference
model: meta/llama-3.1-8b-instruct
timeout: 30 # seconds
# ββ Cognitive Pipeline ββββββββββββββββββββββββββββββββββββββββ
cognitive:
default_effort: medium
think_tool_enabled: true
reflection_passes: 0 # default; overridable per-request
max_loop_iterations: 15
task_analysis_enabled: true
# ββ Memory βββββββββββββββββββββββββββββββββββββββββββββββββββ
memory:
short_term_tokens: 8000
compaction_enabled: true
compaction_threshold: 0.8
long_term_enabled: false
# ββ MCP Tools ββββββββββββββββββββββββββββββββββββββββββββββββ
mcp:
config_file: ./mcp_servers.yaml
timeout_per_call: 30
# ββ Observability ββββββββββββββββββββββββββββββββββββββββββββ
observability:
log_level: INFO
trace_tool_calls: true
metrics_endpoint: /metricsCOGNITRON_LLM__BASE_URL=https://api.openai.com
COGNITRON_LLM__API_KEY=sk-...
COGNITRON_LLM__MODEL=gpt-4o
COGNITRON_COGNITIVE__DEFAULT_EFFORT=high
COGNITRON_MEMORY__SHORT_TERM_TOKENS=16000servers:
- name: filesystem
command: npx
args: ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
- name: brave-search
command: npx
args: ["-y", "@modelcontextprotocol/server-brave-search"]
env:
BRAVE_API_KEY: ${BRAVE_API_KEY}
- name: github
command: npx
args: ["-y", "@modelcontextprotocol/server-github"]
env:
GITHUB_PERSONAL_ACCESS_TOKEN: ${GITHUB_TOKEN}cognitron/
βββ cognitron/ β main package
β βββ main.py β entry point
β βββ gateway.py β FastAPI endpoints
β βββ router.py β cognitive pipeline orchestrator
β βββ schema.py β Pydantic v2 API schemas
β βββ config.py β YAML + env config
β β
β βββ cognitive/ β intelligence modules
β β βββ task_analyzer.py β auto request classification
β β βββ prompt_forge.py β system prompt builder
β β βββ think_injector.py β chain-of-thought injection
β β βββ effort.py β effort tier definitions
β β
β βββ execution/ β runtime engines
β β βββ react_loop.py β ReAct loop orchestrator
β β βββ reflection.py β self-critique engine
β β βββ memory.py β three-tier memory
β β
β βββ llm/ β backend abstraction
β β βββ adapter.py β universal adapter interface
β β βββ response.py β response normalisation
β β βββ backends/
β β βββ openai_compat.py β OpenAI / NIM / vLLM / Groq
β β βββ ollama.py β Ollama local backend
β β βββ anthropic_backend.py β Anthropic Claude
β β
β βββ mcp/ β MCP tool integration
β β βββ server_manager.py β server lifecycle
β β βββ tool_registry.py β tool discovery
β β
β βββ observability/ β production monitoring
β βββ metrics.py β Prometheus metrics
β βββ logging.py β structured logging
β
βββ tests/ β 148 tests, all passing β
βββ .github/workflows/ci.yml β GitHub Actions CI (3 Python versions)
βββ Dockerfile β production container
βββ docker-compose.yml β full stack with Ollama
βββ pyproject.toml
βββ cognitron.yaml β main config
βββ mcp_servers.yaml β MCP registry
docker run -p 8080:8080 \
-e COGNITRON_LLM__BASE_URL=https://integrate.api.nvidia.com \
-e COGNITRON_LLM__API_KEY=your-key \
-e COGNITRON_LLM__MODEL=meta/llama-3.1-8b-instruct \
ghcr.io/iamsujithd/cognitron:latest# Clone and start everything
git clone https://github.com/Iamsujithd/cognitron
cd cognitron
docker compose up -d
# Pull a model into Ollama
docker compose exec ollama ollama pull llama3.1:8b
# Test it
curl http://localhost:8080/healthdocker build -t cognitron:dev .
docker run -p 8080:8080 --env-file .env cognitron:devgit clone https://github.com/Iamsujithd/cognitron
cd cognitron
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"pytest tests/ -v # all 148 tests
pytest tests/ --cov=cognitron # with coverage
ruff check cognitron/ # lint- Fork β create branch
feat/your-feature - Write tests first β TDD enforced
- Run
pytest tests/ -v+ruff check cognitron/ - Open a PR with description of what and why
See CONTRIBUTING.md for full guidelines, code style, and project areas looking for help.
Is Cognitron really a drop-in OpenAI replacement?
Yes. Cognitron implements /v1/chat/completions with the exact same schema. Any SDK or library that works with OpenAI β openai Python client, LangChain, LlamaIndex, cURL β works with Cognitron by just changing base_url.
How does the Think Tool work?
Cognitron appends a think tool to every request's tool list. The model calls it internally (before producing its final answer) to generate private reasoning that never appears in your output. Same technique as Claude's extended thinking β made available for any model.
Can I use it with Ollama locally?
Absolutely β and it's the recommended setup for private workloads. Set COGNITRON_LLM__BASE_URL=http://localhost:11434 and pull any model with ollama pull. The Docker Compose file includes a full local stack with Ollama.
What is a ReAct agent loop?
ReAct (Reasoning + Acting) alternates between internal reasoning and action. Cognitron automates this as a multi-iteration loop β the LLM thinks, calls tools, revises its understanding, and repeats until convergence. This is why complex tasks get dramatically better answers vs a single LLM call.
Can I disable reasoning for fast requests?
Yes. Set "effort": "low" to disable the think tool, skip reflection, and cap ReAct at 3 loops. Near-direct-passthrough performance while still benefiting from task classification and prompt optimisation.
Is it production-ready?
148 passing tests, Prometheus metrics, structured logging, Docker support, retry logic with exponential backoff, configurable timeouts. Validated on real NVIDIA NIM workloads (90% pass rate, 0 timeouts). Pre-1.0 β review open issues before high-traffic deployment.
Where is memory stored?
SQLite by default (cognitron_memory.db). Configure memory.long_term_backend: redis or postgres for production. Memory is keyed by session_id and isolated between sessions.
| Layer | Technology |
|---|---|
| π Gateway | FastAPI + Uvicorn |
| π Schema | Pydantic v2 |
| π HTTP Client | httpx (async, retry) |
| π Logging | structlog (JSON structured) |
| π MCP | MCP Python SDK |
| π’ Tokens | tiktoken |
| π‘ Streaming | SSE-starlette |
| π Metrics | Prometheus compatible |
Released under the MIT License β free for personal and commercial use.
See LICENSE for full text.
Built with π§ for developers who believe LLMs can do better.
β Star on GitHub Β Β·Β π Report a Bug Β Β·Β π‘ Request a Feature Β Β·Β π€ Contribute
Cognitron β OpenAI-compatible LLM middleware Β· chain-of-thought Β· ReAct agent Β· self-reflection AI Β· Ollama middleware Β· AI reasoning middleware Β· vLLM middleware Β· NVIDIA NIM