A production-grade Agentic AI orchestrator built with .NET 10 and Semantic Kernel 1.74 — featuring a full RAG pipeline, autonomous reasoning loops, stateful multi-turn chat, an embedded React UI, enterprise security, and cloud-ready deployment infrastructure.
StackPilot is a fully engineered AI backend demonstrating senior-level system design across the entire AI application stack: from raw text ingestion through hybrid retrieval, autonomous tool-using agents, stateful conversation memory, observability, and enterprise governance — all served through an embedded React web UI that ships inside the same single binary.
Covering every layer a production AI system needs, not just "call an LLM and return the answer."
- Go to Releases and download the zip for your OS
- Unzip it
- Open
.envin any text editor and set your OpenAI API key:OpenAI__ApiKey=sk-your-key-here OpenAI__ModelId=gpt-4o-mini OpenAI__EmbeddingModelId=text-embedding-3-small - Double-click
StackPilot.Api.exe(Windows) or run./StackPilot.Api(Linux/macOS) - Open http://localhost:5050 — the UI loads automatically
Windows SmartScreen warning: Click More info → Run anyway. This is expected for unsigned open-source binaries.
graph TB
User([User])
subgraph UI [Embedded React UI - served from wwwroot]
Chat[Chat Tab]
Playground[Playground Tab]
Logs[Log Ingestion Tab]
Agent[Agent Tab]
end
subgraph API [ASP.NET Core Minimal API]
Auth[JWT Auth + RBAC]
Guard[Guardrail Service]
PII[PII Masking]
Tenant[Tenant Middleware]
end
subgraph RAG [RAG Pipeline]
direction TB
Ingest[Ingestion - SK TextChunker]
Embed[Embedding - text-embedding-3-small]
Store[Vector Store - IVectorStore]
Hybrid[Hybrid Search - Vector + Keyword + RRF]
Rerank[Reranker]
Threshold[Score Threshold]
Compress[Context Compressor]
Cache[Semantic Cache]
RagSvc[RagService]
end
subgraph Agent [Agentic Layer]
direction TB
Plugins[SK Plugins - StackPilot / LogSearch / GitHub]
AgentSvc[AgentService - RunAsync + SolveAsync]
Loop[Think-Act-Observe Loop - MaxIterations=5]
end
subgraph Memory [Stateful Memory]
direction TB
ChatHist[Chat History - IChatHistoryStore]
Window[Sliding Window - MaxMessages=10]
Summarise[Conversation Summariser]
SemMem[Semantic Memory - Vector-stored summaries]
end
subgraph Async [Async Infrastructure]
direction TB
Queue[IJobQueue - Channel / Service Bus]
Worker[IngestionWorker - BackgroundService]
Status[Job Status API]
end
subgraph Obs [Observability]
direction TB
OTel[OpenTelemetry Tracing]
Latency[Latency Tracker]
Budget[Token Budget]
Health[Health Checks]
Audit[Audit Log]
end
LLM([OpenAI - gpt-4o / gpt-4o-mini])
User --> UI --> API
Auth --> Guard --> Tenant
Tenant --> RAG
Tenant --> Agent
Tenant --> Memory
PII --> Store
Ingest --> Embed --> Store
Store --> Hybrid --> Rerank --> Threshold --> Compress --> RagSvc
Cache --> RagSvc
RagSvc --> LLM
Plugins --> AgentSvc --> Loop --> LLM
ChatHist --> Window --> Summarise --> SemMem
Memory --> LLM
Queue --> Worker --> Store
Worker --> Status
RagSvc --> Latency
RagSvc --> Budget
AgentSvc --> Audit
OTel --> Health
- Text Chunking via SK
TextChunkerwith configurable token size and overlap (IngestionOptions) - Embeddings Pipeline using OpenAI
text-embedding-3-smallviaIEmbeddingGenerator<string, Embedding<float>>with 96-chunk batching to stay within OpenAI's 300k token/request limit - Hybrid Search combining cosine vector similarity + TF keyword scoring fused with Reciprocal Rank Fusion (RRF, k=60)
- Score Thresholding — configurable quality gate rejects low-confidence chunks before prompting
- Prompt Optimisation — 6-rule anti-hallucination system instruction, forbidden phrases, fully configurable
- Response Caching —
IResponseCache/MemoryResponseCache, Redis-swap-ready - OpenAPI + Scalar UI at
/scalar/v1— all endpoints typed, tagged, and described
- Native Functions —
[KernelFunction]plugins: system status, log search (live vector store), live GitHub API - Automatic Tool Selection —
FunctionChoiceBehavior.Auto()lets the LLM choose tools - Think → Act → Observe Loop — explicit ReAct implementation with
MaxIterationssafety cap - Full Reasoning Trace — every step (thought / tool / observation) returned in
AgentResponse - Stateful Multi-Turn Chat —
IChatHistoryStore, sliding window, LLM-driven summarisation - Semantic Memory — past conversation summaries embedded and scoped by
userId
- Async Ingestion Queue —
IJobQueue<T>overSystem.Threading.Channels,BackgroundServiceworker, job status polling - Semantic Reranker — retrieve 20 chunks, rerank to top 5 via LLM scoring (
IReranker) - Metadata Filtering —
SearchFilter(TenantId, Source, AfterDate) on all search paths - SHA-256 Deduplication — prevents storing identical chunks twice
- Context Compression — long contexts summarised before the LLM call to reduce token cost
- OpenTelemetry Tracing — AspNetCore + HTTP instrumentation, OTLP-swappable
- Per-Stage Latency Tracking — Retrieval, Reranking, LLM Inference as structured log events
- Token Budget Alerts — cost estimate per query with configurable
MaxCostPerQueryUsd - Semantic Cache — cosine similarity check against cached query embeddings (threshold: 0.97)
- JWT Authentication + RBAC — Bearer token auth;
Admin/User/Readerroles - Soft Multi-Tenancy —
X-Tenant-Idheader; all vector records tagged withtenantId - Prompt Injection Guardrails — static pattern detection on all user-facing inputs
- PII Masking — email, SSN, phone, credit card regex applied before embedding
- Tamper-Resistant Audit Log — every query and agent decision recorded
- Health Checks —
/healthand/health/detailwith per-component status - Dockerfile + docker-compose — multi-stage .NET 10 build; API + Redis in one command
- GitHub Actions CI — build + 158 tests on every push; automatic release on every PR merge to
main
- Embedded React UI — Chat, Playground, Log Ingestion, and Agent tabs compiled into
wwwroot/and served directly from the binary viaUseStaticFiles()+ SPA fallback - Double-click to launch — reads
.envfrom the directory next to the exe; Kestrel bound to port 5050 viaappsettings.json; no terminal or environment variable setup required - Log Ingestion tab — drag-and-drop or paste log files; immediate (sync) or background (async) ingestion with live job status polling; ingestion history persists across tab navigation
- Per-source file deletion — each ingested file is tagged with
metadata["source"]; the trash icon on any history entry removes only that file's chunks from the vector store viaDELETE /store/by-source/{source} - Auto-release on PR merge —
release.ymltriggers on every push tomain; version isMAJOR.MINOR(fromVERSIONfile) + GitHub run number as patch; builds self-contained single-file binaries for Linux, Windows, and macOS in parallel, then publishes a GitHub Release with all three zips attached - Chat history persistence — React Context keeps chat state alive across tab switches; fixed stale-closure bug that caused user messages to disappear from the UI
See Quick Start above. No SDK or terminal required.
Prerequisites: .NET 10 SDK, Node.js 20+, an OpenAI API key
# 1. Build the React UI
cd StackPilot.UI
npm install
npm run build # outputs to StackPilot.Api/wwwroot/
cd ..
# 2. Configure secrets
cd StackPilot.Api
dotnet user-secrets set "OpenAI:ApiKey" "sk-..."
dotnet user-secrets set "OpenAI:ModelId" "gpt-4o-mini"
dotnet user-secrets set "OpenAI:EmbeddingModelId" "text-embedding-3-small"
# 3. Run
dotnet run| URL | Purpose |
|---|---|
http://localhost:5050 |
Embedded React UI |
http://localhost:5050/health |
Health check |
http://localhost:5050/scalar/v1 |
Interactive API docs (dev only) |
http://localhost:5050/dashboard |
Vector store debug view |
export OPENAI_API_KEY=sk-...
docker-compose up --builddotnet test
# 158 tests, 0 failuresStateful multi-turn conversation with semantic memory. Chat history persists when switching tabs. Use New session to start fresh; change the User field to test per-user memory isolation.
Use this tab to ask questions about logs you've uploaded — the agent automatically searches your ingested content.
Upload log files (drag-and-drop or file picker), paste raw text, or mix both. Choose Immediate (sync) or Background (async) mode. The ingestion history lists every uploaded file with its chunk count and timestamp.
To remove a file from the vector store: click the 🗑️ icon on any history entry and confirm. Only that file's chunks are deleted — everything else is untouched.
To wipe everything: use the Clear Vector Store button at the bottom of the page. This also clears the ingestion history.
Recommended workflow: Clear → upload your file → ask questions in Chat or Playground.
Direct access to the RAG /ask endpoint and the /store endpoint. Good for testing retrieval quality and inspecting which source chunks were used in an answer.
Runs the autonomous Think→Act→Observe reasoning loop. The agent uses tools (vector store search, system status, GitHub API) to break down complex goals into steps. The full reasoning trace is shown after each run.
| Method | Endpoint | Description | Auth |
|---|---|---|---|
POST |
/store |
Chunk → embed → persist text (optional source tag) |
— |
POST |
/store/async |
Async ingestion via background queue → 202 + jobId | — |
POST |
/store/deduped |
Ingest with SHA-256 deduplication | — |
GET |
/store |
List all vector store records | — |
DELETE |
/store |
Remove all records from the vector store | — |
DELETE |
/store/by-source/{source} |
Remove all chunks tagged with a specific source | — |
GET |
/jobs/{jobId} |
Poll async ingestion job status | — |
POST |
/ask |
Full RAG pipeline: retrieve → rerank → compress → LLM | — |
POST |
/ask/stream |
SSE streaming RAG answer | — |
POST |
/search |
Hybrid vector + keyword search with metadata filter | — |
POST |
/agent |
Single-shot agent with automatic tool selection | — |
POST |
/agent/solve |
Autonomous Think→Act→Observe reasoning loop with trace | — |
POST |
/chat |
Stateful multi-turn chat with sliding window + memory | — |
GET |
/chat/{sessionId} |
Retrieve full session message history | — |
POST |
/auth/token |
Issue JWT for testing (dev only) | — |
POST |
/evaluate |
Run 10-question RAG accuracy test set | Admin |
GET |
/audit |
Retrieve audit log entries | Admin |
GET |
/health |
Liveness health check | — |
GET |
/health/detail |
Detailed per-component health (JSON) | — |
GET |
/dashboard |
Internal vector store debug UI | — |
StackPilot/
├── StackPilot.Api/
│ ├── Async/ # Queue, Worker, Job Status
│ ├── Deployment/ # Cloud deployment guide
│ ├── Extensions/ # DI registration
│ ├── Middleware/ # Global exception handler
│ ├── Observability/ # OTel, Latency, Token Budget
│ ├── Persistence/ # SQLite chat history + audit log
│ ├── Plugins/ # SK native function plugins (live vector store)
│ ├── Resilience/ # Polly retry + circuit breaker
│ ├── Search/ # IReranker
│ ├── Security/ # JWT, RBAC, Tenancy, Guardrails, PII, Audit
│ ├── Storage/ # IVectorStore + QdrantVectorStore
│ ├── wwwroot/ # Compiled React UI (generated by npm run build)
│ ├── Program.cs # Minimal API endpoints
│ ├── RagService.cs # Core RAG orchestrator
│ ├── AgentService.cs # Agentic reasoning loop
│ ├── ChatService.cs # Stateful multi-turn chat
│ ├── HybridSearchService.cs # RRF fusion search
│ ├── VectorStore.cs # In-memory vector store
│ ├── SemanticCache.cs # Embedding-similarity cache
│ ├── appsettings.json # Kestrel port 5050 + all config sections
│ ├── .env.example # Template for double-click launch
│ ├── Dockerfile
│ └── docker-compose.yml
├── StackPilot.UI/ # React 18 + Vite + Tailwind CSS
│ ├── src/
│ │ ├── api/client.ts # Typed API client
│ │ ├── components/ # Sidebar navigation
│ │ ├── context/AppState.tsx # Shared state (chat, ingestion history)
│ │ └── pages/ # Chat, Playground, Logs, Agent
│ └── vite.config.ts # Dev proxy → localhost:5050
├── StackPilot.Api.Tests/
│ ├── 158 tests across 24 test files
│ └── StackPilotApiFactory.cs # WebApplicationFactory with DI stubs
├── VERSION # MAJOR.MINOR for auto-release versioning
└── .github/
├── workflows/ci.yml # Build + test on every push
└── workflows/release.yml # Auto-release on every merge to main
| Workflow | Trigger | What it does |
|---|---|---|
CI (ci.yml) |
Every push to main, features, phase-*; every PR |
dotnet build + dotnet test (158 tests) |
Release (release.yml) |
Every merge to main |
Builds React UI, publishes self-contained binaries for Linux / Windows / macOS, creates a GitHub Release with all three zips |
Versioning: the VERSION file contains MAJOR.MINOR (e.g. 1.1). The GitHub run number is appended as the patch, producing tags like v1.1.42. To bump the major or minor version, edit VERSION and merge.
| Decision | Chosen | Why |
|---|---|---|
| Search strategy | Hybrid RRF (vector + keyword) | Neither alone is sufficient — keyword catches exact terms, vectors catch semantics; RRF fusion outperforms either individually |
| Plugin isolation | ILlmService + IAgentService interfaces |
Decouples RAG and agent logic from Semantic Kernel — full test coverage with zero OpenAI dependency |
| Memory layers | Sliding window + session store + semantic memory | Mirrors human cognition: working memory (window), short-term (session), long-term (vector-embedded summaries) |
| Queue abstraction | IJobQueue<T> over Channel<T> |
Azure Service Bus / RabbitMQ swap requires one new class; no changes to worker or endpoint code |
| Vector store abstraction | IVectorStore interface |
Azure AI Search / Qdrant drop-in; all consumers remain unchanged |
| Source tagging | metadata["source"] on every chunk |
Enables per-file deletion without tracking chunk IDs in the frontend; works for both sync and async ingestion |
| UI delivery | React SPA compiled into wwwroot/, served by Kestrel |
No separate web server, no CORS configuration, no deploy step — the binary is the full product |
| Single-file publish | PublishSingleFile=true + Environment.ProcessPath |
AppContext.BaseDirectory points to the temp extraction dir in single-file apps; ProcessPath finds the actual exe dir so .env and wwwroot are resolved correctly |
| Semantic cache threshold | 0.97 cosine similarity | Conservative — avoids returning a cached answer for a subtly different question; tunable per use case |
| Reranker default | PassThroughReranker |
Zero token cost by default; LlmReranker activated by DI swap when quality > cost matters |
| Auth scope | JWT on admin endpoints only | Integration tests run without auth headers; production hardens all endpoints at the API Gateway layer |
Cost and latency are controlled at every layer — each expensive operation (embedding, retrieval, LLM call) has a cheaper path in front of it.
| Technique | Where | What it saves |
|---|---|---|
| Two-tier caching | Semantic cache (cosine ≥ 0.97) + keyword response cache, checked before retrieval | Skips retrieval + embedding + LLM entirely on a hit — the most expensive path is avoided completely |
| Embedding batching | EmbeddingService, batch size 96 |
One API call per 96 chunks instead of one-per-chunk; stays under OpenAI's 300k-token/request cap regardless of chunk size |
| Context compression | Triggers only when context > 2000 chars; output capped at 500 words | Cuts prompt tokens on large contexts, and only pays the compression cost when it's actually needed |
| Score thresholding | Configurable gate drops low-relevance chunks before prompting | Fewer tokens to the LLM and higher answer quality |
| Reranker is pass-through by default | PassThroughReranker (zero LLM cost); LlmReranker is an opt-in DI swap |
Quality-vs-cost becomes a config decision, not a hard-coded tax |
| Token budget guard | Estimates cost per query (~4 chars/token), warns above $0.05 | Cost visibility without blocking the request |
| Sliding window + summarisation | Chat caps context at 10 messages, compresses the older half at 20 | Bounded per-turn token cost regardless of conversation length |
| SHA-256 deduplication | /store/deduped — 64-bit text hash |
Never embeds or stores identical content twice |
| Early-exit reranker | Skips work entirely when candidate count ≤ topK | No wasted LLM call |
| Async ingestion | Channel<T>-backed queue + background worker |
Large files return a 202 + jobId immediately instead of blocking the request |
| SSE streaming | /ask/stream |
Time-to-first-token instead of waiting for the full answer |
Every tunable below is bound from appsettings.json (or environment variables) unless marked hardcoded. Production overrides live in appsettings.Production.json.
| Component | Setting | Default | Notes |
|---|---|---|---|
| Kestrel | HTTP port | 5050 |
Bound in appsettings.json so published binaries always listen here |
| Ingestion | MaxTokensPerChunk |
300 |
SK TextChunker chunk size |
| Ingestion | OverlapTokens |
50 |
Overlap between adjacent chunks |
| Embeddings | Batch size | 96 |
hardcoded; OpenAI text-embedding-3-small, 1536 dims |
| Hybrid search | RRF constant k |
60 |
SIGIR 2009 standard; fetches topK×2 from each retriever |
| RAG | ScoreThreshold |
0.0 |
0.0 = disabled; 0.75 recommended for production |
| Reranker | Fetch pool | topK × 4 |
Only when a reranker is active |
| Context compression | Trigger | > 2000 chars |
hardcoded; output capped at 500 words |
| Response cache | Enabled / TTL |
false / 300s |
On in Production; Redis-swap-ready |
| Semantic cache | SimilarityThreshold |
0.97 |
Cosine; on in Production |
| Agent | MaxIterations |
5 |
Hard cap on the Think→Act→Observe loop |
| Agent | MaxTokensPerStep |
2000 |
Per solve step |
| Chat | MaxMessages |
10 |
Sliding-window context size |
| Chat | SummarisationThreshold |
20 |
Compresses the older half into a system summary |
| Token budget | MaxCostPerQueryUsd |
$0.05 |
Warns only, non-blocking |
| Token budget | CostPerThousandTokens |
$0.002 |
gpt-4o-mini rate |
| Guardrails | Max input length | 4000 chars |
hardcoded; plus 13 injection patterns |
| Resilience | Retry attempts | 3 |
Exponential backoff from 1s |
| Resilience | Circuit breaker | 50% over 30s, break 15s |
Min 3 requests to evaluate |
| Rate limiting | Per-IP fixed window | 60 req/min |
Returns 429 on breach |
| Request size | Max body | 10 MB |
Kestrel limit |
| JWT | ExpiryMinutes |
60 |
HMAC-SHA256 |
| Qdrant | VectorSize |
1536 |
Cosine distance; opt-in via Qdrant:Endpoint |
# 1. Store a log file with a source tag
curl -X POST http://localhost:5050/store \
-H "Content-Type: application/json" \
-d '{"text": "2024-01-15 ERROR PowerController voltage spike on channel 3", "source": "power-controller.log"}'
# 2. Ask a grounded question
curl -X POST http://localhost:5050/ask \
-H "Content-Type: application/json" \
-d '{"query": "What errors occurred in the power controller?", "topK": 3}'
# 3. Delete all chunks from a specific file
curl -X DELETE http://localhost:5050/store/by-source/power-controller.log
# 4. Run the autonomous reasoning agent
curl -X POST http://localhost:5050/agent/solve \
-H "Content-Type: application/json" \
-d '{"goal": "Check system health and search logs for any errors"}'
# 5. Start a stateful conversation
curl -X POST http://localhost:5050/chat \
-H "Content-Type: application/json" \
-d '{"sessionId": "demo-1", "userId": "farhad", "message": "What issues did you find in the logs?"}'Documented deliberately — these are the real boundaries of the current design, each with a clear path forward (most already on the roadmap):
- In-memory defaults are single-instance. The job queue, job-status store, and semantic cache live in process memory — they don't survive a restart or share across instances. SQLite (chat/audit) and Qdrant (vectors) cover persistence; the queue would need Azure Service Bus for true multi-instance scale-out. The
IJobQueue<T>abstraction is already in place for that swap. - Token estimation is approximate (
text.Length / 4), not a real tokenizer. It's accurate enough for budget warnings, not for hard billing. - Auth covers admin endpoints only. By design, production enforces auth at an API Gateway; the app itself gates only
/evaluateand/audit. - The ingestion worker is sequential with no per-job timeout. A very large embed can block the queue. Scale-out today means registering more workers, not automatic partitioning.
- Tenant isolation is application-level filtering, not database-enforced row security.
- Phase 1: Production RAG Engine (Days 1–14)
- Phase 2: Agentic Reasoning & Memory (Days 15–35)
- Phase 3: Engineering & Scale (Days 36–70)
- SQLite persistence for ChatHistory and AuditLog
- Qdrant vector store (HTTP REST, Docker-compose included)
- Polly retry + circuit breaker for all LLM calls
- Global exception handler (no stack trace leaks)
- Rate limiting (60 req/min per IP)
- CORS policy (configurable origins)
- Request size limits (10 MB max body)
- Dev-only endpoints gated behind
IsDevelopment() - Embedded React UI (Chat, Playground, Logs, Agent)
- Double-click to launch —
.envnext to exe, no terminal needed - Log ingestion tab with drag-and-drop, sync/async modes, per-file deletion
- Auto-release on PR merge to
main(Linux / Windows / macOS binaries) - Swap
InMemoryJobQueue→ Azure Service Bus (for multi-instance deployments) - Add OTLP exporter → Jaeger / Azure Monitor
- Apply JWT auth to all endpoints at the API Gateway layer
Farhad Shariatzadeh — Senior AI Systems Engineer
Built as a structured programme to demonstrate production-grade AI backend engineering with .NET 10 and Semantic Kernel.