Local-first AI engineering infrastructure for autonomous repository work.
Forge scans a codebase, builds repository intelligence, routes work through specialist local models, writes structured artifacts, applies patches, runs validation, repairs failures, and converges on a result without depending on cloud inference.
Forge is not a prompt wrapper. It is a systems project for local autonomous software engineering: inference lifecycle control, model routing, repository cognition, retrieval, artifacts, patch execution, validation, repair, Git safety, and an operator control plane.
Most AI coding systems hide the hard parts behind a hosted model endpoint. Forge works on the harder local systems problem:
- How do you coordinate multiple specialist coding models on one workstation GPU?
- How do you give a model enough repository context without dumping the whole repo into a prompt?
- How do you turn model output into structured, reviewable, testable patches?
- How do you prevent a model from declaring success when tests or acceptance criteria fail?
- How do you keep repository context, memory, artifacts, and inference inside infrastructure you control?
The central constraint is deliberate: one active heavyweight model runtime at a time. Forge swaps local vLLM runtimes through coder, synthesis, and judge stages so a single machine can run a multi-model engineering workflow without keeping every model resident in VRAM.
| Area | What Forge Implements |
|---|---|
| Local inference | vLLM-backed OpenAI-compatible chat completions and streaming responses |
| Runtime orchestration | Sequential runtime swap, launch, health checks, process metadata, shutdown, and process-group cleanup |
| Model routing | Role-aware model registry for coder, synthesizer, retry, architecture, and judge roles |
| Repository intelligence | Scanner, Tree-sitter AST parsing, symbol graph, AST-aware chunking, embeddings, Qdrant, BM25, hybrid retrieval |
| Context engineering | Priority-based context assembly and token-budget trimming |
| Agent workflow | Courtroom pipeline with PRIMARY_CODER, DEEPSEEK_SYNTH, and JUDGE stages |
| Patch execution | Patch parsing, file writing, sandboxing primitives, pytest execution, failure feedback, repair loops |
| Validation | Patch validation, test gates, acceptance checks, convergence decisions, execution-aware judging |
| Git safety | Status, diff, staged diff, changed files, untracked files, branch/worktree abstractions |
| Observability | Structured runtime artifacts, run history, logs, request tracing, replay and artifact summary modules |
Forge is a Python 3.12 backend with a tracked Control Center screenshot and local model assets.
| Surface | Evidence |
|---|---|
| Backend | backend/ contains 172 Python files and about 14.5k LOC |
| Tests | tests/ contains 73 test files covering runtime, retrieval, orchestration, patches, validation, and GitOps |
| API | backend/app.py, backend/api/routes/chat.py, backend/api/schemas/chat.py |
| Runtime | backend/runtime/runtime_launcher.py, runtime_shutdown.py, runtime_swap_engine.py, runtime_process.py |
| Repo intelligence | backend/repointel/service.py, scanner.py, ast/parser.py, retrieval.py, vector_store.py, graph.py |
| Benchmarks | benchmarks/repointel_benchmark.py, benchmarks/multi_agent_runtime_benchmark.py |
| Config | .env.example, backend/config/settings.py |
| Operator UI asset | frontend/screenshots/forge-control-center-1920.png |
flowchart TD
Operator[Operator Objective] --> Control[Forge Control Center]
Control --> API[FastAPI Gateway]
API --> Runtime[Multi-Agent Runtime]
API --> RepoIntel[Repository Intelligence Engine]
API --> Chat[OpenAI-Compatible Chat API]
RepoIntel --> Scanner[Repository Scanner]
Scanner --> Parser[Tree-sitter Parser]
Parser --> Graph[Symbol Graph]
Parser --> Chunker[AST-Aware Chunker]
Chunker --> Embeddings[Sentence-Transformer Embeddings]
Embeddings --> Qdrant[Local Qdrant Vector Store]
Chunker --> BM25[BM25 Lexical Index]
Qdrant --> Retrieval[Hybrid Retrieval]
BM25 --> Retrieval
Graph --> Context[Context Builder]
Retrieval --> Context
Runtime --> Courtroom[Autonomous Courtroom]
Courtroom --> Swap[Runtime Swap Engine]
Swap --> VLLM[vLLM OpenAI-Compatible Runtime]
Courtroom --> Artifacts[Artifact Store]
Artifacts --> Patch[Patch Parser and Writer]
Patch --> Validation[Tests and Validation]
Validation --> Repair[Repair Loop]
Repair --> Courtroom
Validation --> Git[Git Diff and Review]
Design principles:
- Runtime lifecycle is separate from cognition.
- Repository intelligence is built before code generation.
- Context is assembled under a budget, not blindly appended.
- Model output is persisted as artifacts before it becomes code.
- Tests and acceptance gates drive convergence.
- Git state remains inspectable before commit or rollback.
Forge is organized as cooperating subsystems rather than a single agent loop.
| Subsystem | Primary Files | Responsibility |
|---|---|---|
| API gateway | backend/app.py, backend/api/routes/chat.py |
FastAPI app, health checks, OpenAI-compatible chat route |
| LLM service | backend/llm/service.py, engine.py, prompting.py, decoding.py |
Prompt rendering, vLLM generation, streaming, usage accounting |
| Model registry/router | backend/llm/registry.py, router.py |
Role-to-model registration and generation dispatch |
| Runtime lifecycle | backend/runtime/runtime_* |
Launch, health, swap, shutdown, runtime metadata |
| Courtroom runtime | backend/runtime/autonomous_courtroom.py, convergence_loop.py |
Multi-role reasoning and convergence |
| Repository intelligence | backend/repointel/* |
Scan, parse, chunk, embed, retrieve, plan |
| Patch and execution | backend/runtime/patch_*, execution_*, pytest_runner.py |
Parse model output, write files, run commands, classify failures |
| Artifacts and replay | backend/runtime/artifact_*, replay_context.py |
Persist, load, summarize, compress, replay model artifacts |
| Git operations | backend/runtime/gitops.py, git_diff.py, worktrees.py |
Status, diffs, staging, worktree/repository safety |
Forge can use local models for different engineering responsibilities instead of forcing one model to be coder, reviewer, repair engine, and judge.
flowchart LR
Objective[Objective] --> Router[Model Router]
Router --> Coder[primary_coder]
Router --> Synth[repo_synthesizer]
Router --> Retry[retry_engine]
Router --> Architect[architecture_coder]
Router --> Judge[judge]
Coder --> ArtifactA[Implementation Artifact]
Synth --> ArtifactB[Risk and Design Critique]
Retry --> ArtifactC[Repair Proposal]
Architect --> ArtifactD[Architecture-Aware Patch]
Judge --> Verdict[Acceptance Verdict]
The canonical courtroom path currently uses:
PRIMARY_CODER: implementation artifact generation.DEEPSEEK_SYNTH: architecture and risk critique.JUDGE: convergence and acceptance decision.
Forge uses three local model slots because the autonomous coding loop needs role separation without turning the runtime into an unbounded model zoo. The names below are the served model names passed to the local OpenAI-compatible vLLM server in backend/runtime/autonomous_courtroom.py.
| Courtroom role | Served model name | Base model | Local path | Why this model is used |
|---|---|---|---|---|
PRIMARY_CODER |
qwen-primary |
Qwen/Qwen3.5-35B-A3B |
models/qwen-primary |
Primary implementation model. It is the default code-writing model because Qwen3.5 provides strong long-context reasoning, tool-use behavior, and coding capability for repository-scale edits. |
DEEPSEEK_SYNTH |
deepseek-synth |
deepseek-ai/deepseek-coder-33b-instruct |
models/deepseek-synth |
Independent synthesis and critique model. It gives the loop a different coder-family prior for architectural review, failure analysis, and challenging the primary implementation before a verdict is made. |
JUDGE |
qwen-judge |
Qwen/QwQ-32B |
models/qwen-judge |
Dedicated reasoning judge. It keeps final acceptance separate from the implementation model, reducing self-approval bias when evaluating tests, risks, and convergence. |
Forge keeps the default runtime to these three models only because each role maps to a concrete engineering responsibility: generate, critique, and decide. Adding more always-on models would increase swap time, artifact volume, and validation latency without improving the core control loop unless a new model owns a distinct production responsibility.
Forge's runtime strategy is built for a workstation constraint: multiple large models, one active GPU runtime.
sequenceDiagram
autonumber
participant Run as Autonomous Run
participant Swap as Runtime Swap Engine
participant Launch as Runtime Launcher
participant Health as Runtime Health
participant VLLM as vLLM Server
participant GPU as Single GPU
Run->>Swap: activate PRIMARY_CODER
Swap->>Launch: launch qwen-primary
Launch->>GPU: allocate VRAM
Health->>VLLM: wait for readiness
Run->>VLLM: chat completion
VLLM-->>Run: coder artifact
Run->>Swap: shutdown active runtime
Swap->>GPU: release VRAM
Run->>Swap: activate DEEPSEEK_SYNTH
Swap->>Launch: launch deepseek-synth
Health->>VLLM: wait for readiness
Run->>VLLM: critique artifact
Run->>Swap: shutdown active runtime
Run->>Swap: activate JUDGE
Swap->>Launch: launch qwen-judge
Health->>VLLM: wait for readiness
Run->>VLLM: verdict artifact
Run->>Swap: shutdown active runtime
Runtime primitives:
RuntimeProcess: role, model path, served model name, port, PID, PGID, launch state.RuntimeLauncher: starts a vLLM OpenAI-compatible server and writes role-specific logs.RuntimeShutdown: terminates the whole process group so child CUDA workers are cleaned up.RuntimeSwapEngine: guarantees shutdown-before-launch and preserves swap history.LocalInference: calls the active local OpenAI-compatible endpoint and normalizes model output.
flowchart TD
A[Create Objective] --> B[Bind Repository]
B --> C[Scan Repository]
C --> D[Parse AST and Build Symbol Graph]
D --> E[Index Chunks in Qdrant and BM25]
E --> F[Build Context Package]
F --> G[Plan Execution]
G --> H[Run PRIMARY_CODER]
H --> I[Persist Artifact]
I --> J[Run DEEPSEEK_SYNTH]
J --> K[Persist Critique]
K --> L[Run JUDGE]
L --> M{Accepted?}
M -- no --> N[Repair or Refine Objective]
N --> H
M -- yes --> O[Parse Patch]
O --> P[Write Files]
P --> Q[Run Tests]
Q --> R{Validation Passes?}
R -- no --> N
R -- yes --> S[Show Diff and Artifacts]
Forge builds local repository intelligence before prompting the coding model.
flowchart LR
Repo[Repository Root] --> Scan[Scanner]
Scan --> Ignore[Gitignore and Ignore Rules]
Ignore --> Files[RepoFile Manifest]
Files --> AST[Tree-sitter AST]
AST --> Symbols[Symbol Extraction]
AST --> Chunks[AST-Aware Chunks]
Symbols --> Graph[Symbol Graph]
Chunks --> Embed[Embedding Service]
Embed --> Vector[Qdrant Vector Store]
Chunks --> Lexical[BM25 Index]
Vector --> Hybrid[Hybrid Retrieval]
Lexical --> Hybrid
Graph --> Context[Context Builder]
Hybrid --> Context
Context --> Plan[Planning Layer]
Implemented repository intelligence components:
RepositoryScanner: file discovery, ignore handling, incremental manifest.TreeSitterAstEngine: Python, JavaScript, TypeScript, Go, Rust parser support through Tree-sitter packages.SymbolGraphEngine: file and symbol relationships.AstAwareChunker: symbol-aware context chunks.EmbeddingService: local sentence-transformer embeddings.QdrantVectorStore: local persisted vector collections.BM25Index: lexical search for exact code terms.HybridRetrievalEngine: vector + lexical + overlap reranking.ContextBuilderandPlanningLayer: context packaging and execution planning.
flowchart TD
Objective[Objective] --> Context[Repository Context]
Context --> CoderPrompt[Coder Prompt]
CoderPrompt --> Coder[PRIMARY_CODER]
Coder --> CoderArtifact[Coder Artifact]
CoderArtifact --> SynthPrompt[Synthesis Prompt]
SynthPrompt --> Synth[DEEPSEEK_SYNTH]
Synth --> Critique[Critique Artifact]
CoderArtifact --> JudgePrompt[Judge Prompt]
Critique --> JudgePrompt
Tests[Test and Repair History] --> JudgePrompt
JudgePrompt --> Judge[JUDGE]
Judge --> Verdict{Converged?}
Verdict -- no --> Refine[Refined Objective]
Refine --> CoderPrompt
Verdict -- yes --> Patch[Patch Parser and Writer]
Patch --> Validation[Validation Gates]
Each role produces inspectable artifacts. Forge can replay, summarize, compress, merge, and query artifacts instead of relying on hidden chain state.
Forge exposes a local chat completion route:
POST /v1/chat/completionsSupported request fields include:
modelrequest_idagent_idmessagestemperaturemax_tokenstop_pstream
The API validates request shape with Pydantic and supports both JSON responses and SSE streaming.
Example:
curl http://localhost:8000/v1/chat/completions \
-H "content-type: application/json" \
-d '{
"model": "deepseek-coder",
"messages": [{"role": "user", "content": "Explain this repository"}],
"temperature": 0.2
}'Current supported path: local workstation.
Requirements:
- Python 3.12.
- CUDA-capable machine for vLLM-backed local inference.
- Local model directories or Hugging Face model access.
- Enough disk for model weights and local vector indexes.
Install:
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .envRun API:
uvicorn backend.app:app --host 0.0.0.0 --port 8000Run tests:
pytest -qRun repository intelligence benchmark:
python benchmarks/repointel_benchmark.py /path/to/repo "authentication flow"Run multi-agent runtime benchmark:
python benchmarks/multi_agent_runtime_benchmark.py --task-count 10Current repository does not include Docker, Compose, or Kubernetes deployment assets. Those are roadmap items, not current README claims.
Forge is not presented as a managed SaaS. It is a local infrastructure codebase with production-oriented primitives:
- Explicit runtime process metadata and lifecycle state.
- Process-group shutdown for vLLM child workers.
- Pydantic API validation with forbidden extra fields.
- Structured request tracing and usage accounting.
- Local Qdrant persistence for retrieval state.
- Repository-scoped ignore rules and incremental index state.
- Patch validation before writing.
- Pytest execution inside a controlled working directory.
- Git status, diff, staged diff, changed files, and untracked files APIs.
- Artifact persistence, replay, compression, and summary modules.
- Convergence, retry, recovery, and repair abstractions.
- Broad test suite across runtime, repository intelligence, orchestration, patches, validation, artifacts, and GitOps.
Forge is designed around controllable local performance rather than hosted API throughput.
| Characteristic | Implementation |
|---|---|
| GPU memory pressure | Sequential one-runtime-at-a-time autoswap |
| Inference throughput | vLLM serving with prefix caching and configurable max sequences |
| Context window pressure | Priority-based context budget manager |
| Retrieval latency | Local Qdrant plus in-process BM25 index |
| Repository indexing | Incremental manifest, AST parsing, chunking, embeddings, vector upsert |
| Runtime observability | Runtime logs, PID/PGID metadata, swap history, request traces |
| Benchmarking | Separate scripts for repository intelligence and multi-agent runtime throughput |
Exact throughput depends on GPU, model size, quantization, context length, and repository size. The repo includes benchmark scripts so measurements can be produced on the target hardware instead of guessed in documentation.
Forge's security posture is local-first and review-first:
- Repository content, embeddings, vector store, artifacts, and model calls can stay on local infrastructure.
- No cloud model provider is required by the core local runtime path.
- API schemas reject unexpected fields.
- Runtime shutdown targets process groups to reduce orphan worker risk.
- Patch outputs are parsed and validated before application.
- Test and validation gates feed convergence decisions.
- Git diffs remain visible before commit.
- Worktree and workspace abstractions are present for repository isolation.
This is not a substitute for sandboxing untrusted code. Treat model-generated code and test execution as privileged local operations unless additional isolation is added.
| Layer | Stack |
|---|---|
| API | FastAPI, Pydantic, Uvicorn-compatible ASGI |
| Inference | vLLM, OpenAI-compatible chat completions, transformers tokenizer |
| Runtime | subprocess process groups, runtime metadata, local HTTP inference |
| Retrieval | Qdrant local store, sentence-transformers, BM25 |
| Parsing | Tree-sitter for Python, JavaScript, TypeScript, Go, Rust |
| Agent runtime | Multi-agent orchestration, courtroom roles, convergence loops |
| Validation | pytest, execution policy, patch validation, Git diffs |
| Persistence | filesystem artifacts, local .forge state, local model directories |
| Testing | pytest, pytest-asyncio |
Near term:
- Stronger runtime health checks around vLLM readiness and model registry validation.
- More complete benchmark reporting with hardware profiles.
- Improved patch explainability and artifact diff views.
- Better visual validation for generated frontend work.
- Contributor guide and issue templates.
Medium term:
- Docker and Compose deployment.
- Multi-GPU scheduling.
- More robust runtime placement and port management.
- Richer repository graph queries.
- Operator approval gates before file writes.
- Frontend source tracked alongside screenshot assets.
Long term:
- Team-shared local memory stores.
- Distributed local runtimes.
- Enterprise local deployment patterns.
- Reproducible autonomous engineering benchmark suite.
- Hybrid local/cloud routing where policy allows it.
High-impact contribution areas:
- Runtime lifecycle reliability.
- Repository intelligence and retrieval quality.
- Patch parsing and validation.
- Test coverage for failure and repair loops.
- Benchmarking with reproducible hardware profiles.
- Operator UI source and interaction design.
- Documentation for local model setup.
Development loop:
pip install -e ".[dev]"
pytest -q
python scripts/smoke_test_repo_intel.py
python scripts/smoke_test_multi_agent_runtime.pyBefore opening a PR, include:
- The problem being solved.
- The subsystem touched.
- Tests or benchmark commands run.
- Any model/runtime assumptions.
- Any repository safety implications.
Useful autonomous software engineering is not one model call. It is infrastructure:
- model serving
- runtime lifecycle control
- repository understanding
- context retrieval
- structured artifacts
- validation gates
- repair loops
- Git safety
- operator visibility
Forge is an implementation of that thesis for local AI engineering.
