StackPilot

A production-grade Agentic AI orchestrator built with .NET 10 and Semantic Kernel 1.74 — featuring a full RAG pipeline, autonomous reasoning loops, stateful multi-turn chat, an embedded React UI, enterprise security, and cloud-ready deployment infrastructure.

What is StackPilot?

StackPilot is a fully engineered AI backend demonstrating senior-level system design across the entire AI application stack: from raw text ingestion through hybrid retrieval, autonomous tool-using agents, stateful conversation memory, observability, and enterprise governance — all served through an embedded React web UI that ships inside the same single binary.

Covering every layer a production AI system needs, not just "call an LLM and return the answer."

Quick Start — no install, no terminal needed

Go to Releases and download the zip for your OS
Unzip it

Open .env in any text editor and set your OpenAI API key:

OpenAI__ApiKey=sk-your-key-here
OpenAI__ModelId=gpt-4o-mini
OpenAI__EmbeddingModelId=text-embedding-3-small

Double-click StackPilot.Api.exe (Windows) or run ./StackPilot.Api (Linux/macOS)
Open http://localhost:5050 — the UI loads automatically

Windows SmartScreen warning: Click More info → Run anyway. This is expected for unsigned open-source binaries.

Architecture

graph TB
    User([User])

    subgraph UI [Embedded React UI - served from wwwroot]
        Chat[Chat Tab]
        Playground[Playground Tab]
        Logs[Log Ingestion Tab]
        Agent[Agent Tab]
    end

    subgraph API [ASP.NET Core Minimal API]
        Auth[JWT Auth + RBAC]
        Guard[Guardrail Service]
        PII[PII Masking]
        Tenant[Tenant Middleware]
    end

    subgraph RAG [RAG Pipeline]
        direction TB
        Ingest[Ingestion - SK TextChunker]
        Embed[Embedding - text-embedding-3-small]
        Store[Vector Store - IVectorStore]
        Hybrid[Hybrid Search - Vector + Keyword + RRF]
        Rerank[Reranker]
        Threshold[Score Threshold]
        Compress[Context Compressor]
        Cache[Semantic Cache]
        RagSvc[RagService]
    end

    subgraph Agent [Agentic Layer]
        direction TB
        Plugins[SK Plugins - StackPilot / LogSearch / GitHub]
        AgentSvc[AgentService - RunAsync + SolveAsync]
        Loop[Think-Act-Observe Loop - MaxIterations=5]
    end

    subgraph Memory [Stateful Memory]
        direction TB
        ChatHist[Chat History - IChatHistoryStore]
        Window[Sliding Window - MaxMessages=10]
        Summarise[Conversation Summariser]
        SemMem[Semantic Memory - Vector-stored summaries]
    end

    subgraph Async [Async Infrastructure]
        direction TB
        Queue[IJobQueue - Channel / Service Bus]
        Worker[IngestionWorker - BackgroundService]
        Status[Job Status API]
    end

    subgraph Obs [Observability]
        direction TB
        OTel[OpenTelemetry Tracing]
        Latency[Latency Tracker]
        Budget[Token Budget]
        Health[Health Checks]
        Audit[Audit Log]
    end

    LLM([OpenAI - gpt-4o / gpt-4o-mini])

    User --> UI --> API
    Auth --> Guard --> Tenant
    Tenant --> RAG
    Tenant --> Agent
    Tenant --> Memory
    PII --> Store

    Ingest --> Embed --> Store
    Store --> Hybrid --> Rerank --> Threshold --> Compress --> RagSvc
    Cache --> RagSvc
    RagSvc --> LLM

    Plugins --> AgentSvc --> Loop --> LLM
    ChatHist --> Window --> Summarise --> SemMem
    Memory --> LLM

    Queue --> Worker --> Store
    Worker --> Status

    RagSvc --> Latency
    RagSvc --> Budget
    AgentSvc --> Audit
    OTel --> Health

Feature Highlights

Phase 1 — Production RAG Engine

Text Chunking via SK TextChunker with configurable token size and overlap (IngestionOptions)
Embeddings Pipeline using OpenAI text-embedding-3-small via IEmbeddingGenerator<string, Embedding<float>> with 96-chunk batching to stay within OpenAI's 300k token/request limit
Hybrid Search combining cosine vector similarity + TF keyword scoring fused with Reciprocal Rank Fusion (RRF, k=60)
Score Thresholding — configurable quality gate rejects low-confidence chunks before prompting
Prompt Optimisation — 6-rule anti-hallucination system instruction, forbidden phrases, fully configurable
Response Caching — IResponseCache / MemoryResponseCache, Redis-swap-ready
OpenAPI + Scalar UI at /scalar/v1 — all endpoints typed, tagged, and described

Phase 2 — Agentic Reasoning & Memory

Native Functions — [KernelFunction] plugins: system status, log search (live vector store), live GitHub API
Automatic Tool Selection — FunctionChoiceBehavior.Auto() lets the LLM choose tools
Think → Act → Observe Loop — explicit ReAct implementation with MaxIterations safety cap
Full Reasoning Trace — every step (thought / tool / observation) returned in AgentResponse
Stateful Multi-Turn Chat — IChatHistoryStore, sliding window, LLM-driven summarisation
Semantic Memory — past conversation summaries embedded and scoped by userId

Phase 3 — Engineering & Scale

Async Ingestion Queue — IJobQueue<T> over System.Threading.Channels, BackgroundService worker, job status polling
Semantic Reranker — retrieve 20 chunks, rerank to top 5 via LLM scoring (IReranker)
Metadata Filtering — SearchFilter (TenantId, Source, AfterDate) on all search paths
SHA-256 Deduplication — prevents storing identical chunks twice
Context Compression — long contexts summarised before the LLM call to reduce token cost
OpenTelemetry Tracing — AspNetCore + HTTP instrumentation, OTLP-swappable
Per-Stage Latency Tracking — Retrieval, Reranking, LLM Inference as structured log events
Token Budget Alerts — cost estimate per query with configurable MaxCostPerQueryUsd
Semantic Cache — cosine similarity check against cached query embeddings (threshold: 0.97)
JWT Authentication + RBAC — Bearer token auth; Admin / User / Reader roles
Soft Multi-Tenancy — X-Tenant-Id header; all vector records tagged with tenantId
Prompt Injection Guardrails — static pattern detection on all user-facing inputs
PII Masking — email, SSN, phone, credit card regex applied before embedding
Tamper-Resistant Audit Log — every query and agent decision recorded
Health Checks — /health and /health/detail with per-component status
Dockerfile + docker-compose — multi-stage .NET 10 build; API + Redis in one command
GitHub Actions CI — build + 158 tests on every push; automatic release on every PR merge to main

Phase 4 — Embedded UI & Production Delivery

Embedded React UI — Chat, Playground, Log Ingestion, and Agent tabs compiled into wwwroot/ and served directly from the binary via UseStaticFiles() + SPA fallback
Double-click to launch — reads .env from the directory next to the exe; Kestrel bound to port 5050 via appsettings.json; no terminal or environment variable setup required
Log Ingestion tab — drag-and-drop or paste log files; immediate (sync) or background (async) ingestion with live job status polling; ingestion history persists across tab navigation
Per-source file deletion — each ingested file is tagged with metadata["source"]; the trash icon on any history entry removes only that file's chunks from the vector store via DELETE /store/by-source/{source}
Auto-release on PR merge — release.yml triggers on every push to main; version is MAJOR.MINOR (from VERSION file) + GitHub run number as patch; builds self-contained single-file binaries for Linux, Windows, and macOS in parallel, then publishes a GitHub Release with all three zips attached
Chat history persistence — React Context keeps chat state alive across tab switches; fixed stale-closure bug that caused user messages to disappear from the UI

How to Run

Option A — Download a release (easiest)

See Quick Start above. No SDK or terminal required.

Option B — Run from source

Prerequisites: .NET 10 SDK, Node.js 20+, an OpenAI API key

# 1. Build the React UI
cd StackPilot.UI
npm install
npm run build        # outputs to StackPilot.Api/wwwroot/
cd ..

# 2. Configure secrets
cd StackPilot.Api
dotnet user-secrets set "OpenAI:ApiKey"            "sk-..."
dotnet user-secrets set "OpenAI:ModelId"           "gpt-4o-mini"
dotnet user-secrets set "OpenAI:EmbeddingModelId"  "text-embedding-3-small"

# 3. Run
dotnet run

URL	Purpose
`http://localhost:5050`	Embedded React UI
`http://localhost:5050/health`	Health check
`http://localhost:5050/scalar/v1`	Interactive API docs (dev only)
`http://localhost:5050/dashboard`	Vector store debug view

Option C — Run with Docker

export OPENAI_API_KEY=sk-...
docker-compose up --build

Run tests

dotnet test
# 158 tests, 0 failures

UI Guide

Chat tab

Stateful multi-turn conversation with semantic memory. Chat history persists when switching tabs. Use New session to start fresh; change the User field to test per-user memory isolation.

Use this tab to ask questions about logs you've uploaded — the agent automatically searches your ingested content.

Log Ingestion tab

Upload log files (drag-and-drop or file picker), paste raw text, or mix both. Choose Immediate (sync) or Background (async) mode. The ingestion history lists every uploaded file with its chunk count and timestamp.

To remove a file from the vector store: click the 🗑️ icon on any history entry and confirm. Only that file's chunks are deleted — everything else is untouched.

To wipe everything: use the Clear Vector Store button at the bottom of the page. This also clears the ingestion history.

Recommended workflow: Clear → upload your file → ask questions in Chat or Playground.

Playground tab

Direct access to the RAG /ask endpoint and the /store endpoint. Good for testing retrieval quality and inspecting which source chunks were used in an answer.

Agent tab

Runs the autonomous Think→Act→Observe reasoning loop. The agent uses tools (vector store search, system status, GitHub API) to break down complex goals into steps. The full reasoning trace is shown after each run.

API Reference

Method	Endpoint	Description	Auth
`POST`	`/store`	Chunk → embed → persist text (optional `source` tag)	—
`POST`	`/store/async`	Async ingestion via background queue → 202 + jobId	—
`POST`	`/store/deduped`	Ingest with SHA-256 deduplication	—
`GET`	`/store`	List all vector store records	—
`DELETE`	`/store`	Remove all records from the vector store	—
`DELETE`	`/store/by-source/{source}`	Remove all chunks tagged with a specific source	—
`GET`	`/jobs/{jobId}`	Poll async ingestion job status	—
`POST`	`/ask`	Full RAG pipeline: retrieve → rerank → compress → LLM	—
`POST`	`/ask/stream`	SSE streaming RAG answer	—
`POST`	`/search`	Hybrid vector + keyword search with metadata filter	—
`POST`	`/agent`	Single-shot agent with automatic tool selection	—
`POST`	`/agent/solve`	Autonomous Think→Act→Observe reasoning loop with trace	—
`POST`	`/chat`	Stateful multi-turn chat with sliding window + memory	—
`GET`	`/chat/{sessionId}`	Retrieve full session message history	—
`POST`	`/auth/token`	Issue JWT for testing (dev only)	—
`POST`	`/evaluate`	Run 10-question RAG accuracy test set	Admin
`GET`	`/audit`	Retrieve audit log entries	Admin
`GET`	`/health`	Liveness health check	—
`GET`	`/health/detail`	Detailed per-component health (JSON)	—
`GET`	`/dashboard`	Internal vector store debug UI	—

Project Structure

StackPilot/
├── StackPilot.Api/
│   ├── Async/                   # Queue, Worker, Job Status
│   ├── Deployment/              # Cloud deployment guide
│   ├── Extensions/              # DI registration
│   ├── Middleware/              # Global exception handler
│   ├── Observability/           # OTel, Latency, Token Budget
│   ├── Persistence/             # SQLite chat history + audit log
│   ├── Plugins/                 # SK native function plugins (live vector store)
│   ├── Resilience/              # Polly retry + circuit breaker
│   ├── Search/                  # IReranker
│   ├── Security/                # JWT, RBAC, Tenancy, Guardrails, PII, Audit
│   ├── Storage/                 # IVectorStore + QdrantVectorStore
│   ├── wwwroot/                 # Compiled React UI (generated by npm run build)
│   ├── Program.cs               # Minimal API endpoints
│   ├── RagService.cs            # Core RAG orchestrator
│   ├── AgentService.cs          # Agentic reasoning loop
│   ├── ChatService.cs           # Stateful multi-turn chat
│   ├── HybridSearchService.cs   # RRF fusion search
│   ├── VectorStore.cs           # In-memory vector store
│   ├── SemanticCache.cs         # Embedding-similarity cache
│   ├── appsettings.json         # Kestrel port 5050 + all config sections
│   ├── .env.example             # Template for double-click launch
│   ├── Dockerfile
│   └── docker-compose.yml
├── StackPilot.UI/               # React 18 + Vite + Tailwind CSS
│   ├── src/
│   │   ├── api/client.ts        # Typed API client
│   │   ├── components/          # Sidebar navigation
│   │   ├── context/AppState.tsx # Shared state (chat, ingestion history)
│   │   └── pages/               # Chat, Playground, Logs, Agent
│   └── vite.config.ts           # Dev proxy → localhost:5050
├── StackPilot.Api.Tests/
│   ├── 158 tests across 24 test files
│   └── StackPilotApiFactory.cs  # WebApplicationFactory with DI stubs
├── VERSION                      # MAJOR.MINOR for auto-release versioning
└── .github/
    ├── workflows/ci.yml         # Build + test on every push
    └── workflows/release.yml    # Auto-release on every merge to main

CI / CD

Workflow	Trigger	What it does
CI (`ci.yml`)	Every push to `main`, `features`, `phase-*`; every PR	`dotnet build` + `dotnet test` (158 tests)
Release (`release.yml`)	Every merge to `main`	Builds React UI, publishes self-contained binaries for Linux / Windows / macOS, creates a GitHub Release with all three zips

Versioning: the VERSION file contains MAJOR.MINOR (e.g. 1.1). The GitHub run number is appended as the patch, producing tags like v1.1.42. To bump the major or minor version, edit VERSION and merge.

Key Engineering Decisions

Decision	Chosen	Why
Search strategy	Hybrid RRF (vector + keyword)	Neither alone is sufficient — keyword catches exact terms, vectors catch semantics; RRF fusion outperforms either individually
Plugin isolation	`ILlmService` + `IAgentService` interfaces	Decouples RAG and agent logic from Semantic Kernel — full test coverage with zero OpenAI dependency
Memory layers	Sliding window + session store + semantic memory	Mirrors human cognition: working memory (window), short-term (session), long-term (vector-embedded summaries)
Queue abstraction	`IJobQueue<T>` over `Channel<T>`	Azure Service Bus / RabbitMQ swap requires one new class; no changes to worker or endpoint code
Vector store abstraction	`IVectorStore` interface	Azure AI Search / Qdrant drop-in; all consumers remain unchanged
Source tagging	`metadata["source"]` on every chunk	Enables per-file deletion without tracking chunk IDs in the frontend; works for both sync and async ingestion
UI delivery	React SPA compiled into `wwwroot/`, served by Kestrel	No separate web server, no CORS configuration, no deploy step — the binary is the full product
Single-file publish	`PublishSingleFile=true` + `Environment.ProcessPath`	`AppContext.BaseDirectory` points to the temp extraction dir in single-file apps; `ProcessPath` finds the actual exe dir so `.env` and `wwwroot` are resolved correctly
Semantic cache threshold	0.97 cosine similarity	Conservative — avoids returning a cached answer for a subtly different question; tunable per use case
Reranker default	`PassThroughReranker`	Zero token cost by default; `LlmReranker` activated by DI swap when quality > cost matters
Auth scope	JWT on admin endpoints only	Integration tests run without auth headers; production hardens all endpoints at the API Gateway layer

How Efficiency Is Engineered

Cost and latency are controlled at every layer — each expensive operation (embedding, retrieval, LLM call) has a cheaper path in front of it.

Technique	Where	What it saves
Two-tier caching	Semantic cache (cosine ≥ 0.97) + keyword response cache, checked before retrieval	Skips retrieval + embedding + LLM entirely on a hit — the most expensive path is avoided completely
Embedding batching	`EmbeddingService`, batch size 96	One API call per 96 chunks instead of one-per-chunk; stays under OpenAI's 300k-token/request cap regardless of chunk size
Context compression	Triggers only when context > 2000 chars; output capped at 500 words	Cuts prompt tokens on large contexts, and only pays the compression cost when it's actually needed
Score thresholding	Configurable gate drops low-relevance chunks before prompting	Fewer tokens to the LLM and higher answer quality
Reranker is pass-through by default	`PassThroughReranker` (zero LLM cost); `LlmReranker` is an opt-in DI swap	Quality-vs-cost becomes a config decision, not a hard-coded tax
Token budget guard	Estimates cost per query (~4 chars/token), warns above $0.05	Cost visibility without blocking the request
Sliding window + summarisation	Chat caps context at 10 messages, compresses the older half at 20	Bounded per-turn token cost regardless of conversation length
SHA-256 deduplication	`/store/deduped` — 64-bit text hash	Never embeds or stores identical content twice
Early-exit reranker	Skips work entirely when candidate count ≤ topK	No wasted LLM call
Async ingestion	`Channel<T>`-backed queue + background worker	Large files return a `202` + jobId immediately instead of blocking the request
SSE streaming	`/ask/stream`	Time-to-first-token instead of waiting for the full answer

Configuration & Constants Reference

Every tunable below is bound from appsettings.json (or environment variables) unless marked hardcoded. Production overrides live in appsettings.Production.json.

Component	Setting	Default	Notes
Kestrel	HTTP port	`5050`	Bound in `appsettings.json` so published binaries always listen here
Ingestion	`MaxTokensPerChunk`	`300`	SK `TextChunker` chunk size
Ingestion	`OverlapTokens`	`50`	Overlap between adjacent chunks
Embeddings	Batch size	`96`	hardcoded; OpenAI `text-embedding-3-small`, 1536 dims
Hybrid search	RRF constant `k`	`60`	SIGIR 2009 standard; fetches topK×2 from each retriever
RAG	`ScoreThreshold`	`0.0`	`0.0` = disabled; `0.75` recommended for production
Reranker	Fetch pool	`topK × 4`	Only when a reranker is active
Context compression	Trigger	`> 2000 chars`	hardcoded; output capped at 500 words
Response cache	`Enabled` / TTL	`false` / `300s`	On in `Production`; Redis-swap-ready
Semantic cache	`SimilarityThreshold`	`0.97`	Cosine; on in `Production`
Agent	`MaxIterations`	`5`	Hard cap on the Think→Act→Observe loop
Agent	`MaxTokensPerStep`	`2000`	Per solve step
Chat	`MaxMessages`	`10`	Sliding-window context size
Chat	`SummarisationThreshold`	`20`	Compresses the older half into a system summary
Token budget	`MaxCostPerQueryUsd`	`$0.05`	Warns only, non-blocking
Token budget	`CostPerThousandTokens`	`$0.002`	gpt-4o-mini rate
Guardrails	Max input length	`4000 chars`	hardcoded; plus 13 injection patterns
Resilience	Retry attempts	`3`	Exponential backoff from 1s
Resilience	Circuit breaker	`50%` over `30s`, break `15s`	Min 3 requests to evaluate
Rate limiting	Per-IP fixed window	`60 req/min`	Returns `429` on breach
Request size	Max body	`10 MB`	Kestrel limit
JWT	`ExpiryMinutes`	`60`	HMAC-SHA256
Qdrant	`VectorSize`	`1536`	Cosine distance; opt-in via `Qdrant:Endpoint`

Full Demo

# 1. Store a log file with a source tag
curl -X POST http://localhost:5050/store \
  -H "Content-Type: application/json" \
  -d '{"text": "2024-01-15 ERROR PowerController voltage spike on channel 3", "source": "power-controller.log"}'

# 2. Ask a grounded question
curl -X POST http://localhost:5050/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "What errors occurred in the power controller?", "topK": 3}'

# 3. Delete all chunks from a specific file
curl -X DELETE http://localhost:5050/store/by-source/power-controller.log

# 4. Run the autonomous reasoning agent
curl -X POST http://localhost:5050/agent/solve \
  -H "Content-Type: application/json" \
  -d '{"goal": "Check system health and search logs for any errors"}'

# 5. Start a stateful conversation
curl -X POST http://localhost:5050/chat \
  -H "Content-Type: application/json" \
  -d '{"sessionId": "demo-1", "userId": "farhad", "message": "What issues did you find in the logs?"}'

Known Limitations

Documented deliberately — these are the real boundaries of the current design, each with a clear path forward (most already on the roadmap):

In-memory defaults are single-instance. The job queue, job-status store, and semantic cache live in process memory — they don't survive a restart or share across instances. SQLite (chat/audit) and Qdrant (vectors) cover persistence; the queue would need Azure Service Bus for true multi-instance scale-out. The IJobQueue<T> abstraction is already in place for that swap.
Token estimation is approximate (text.Length / 4), not a real tokenizer. It's accurate enough for budget warnings, not for hard billing.
Auth covers admin endpoints only. By design, production enforces auth at an API Gateway; the app itself gates only /evaluate and /audit.
The ingestion worker is sequential with no per-job timeout. A very large embed can block the queue. Scale-out today means registering more workers, not automatic partitioning.
Tenant isolation is application-level filtering, not database-enforced row security.

Roadmap

Author

Farhad Shariatzadeh — Senior AI Systems Engineer
Built as a structured programme to demonstrate production-grade AI backend engineering with .NET 10 and Semantic Kernel.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.claude		.claude
.github/workflows		.github/workflows
.vscode		.vscode
StackPilot.Api.Tests		StackPilot.Api.Tests
StackPilot.Api		StackPilot.Api
StackPilot.UI		StackPilot.UI
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
StackPilot.sln		StackPilot.sln
VERSION		VERSION
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

StackPilot

What is StackPilot?

Quick Start — no install, no terminal needed

Architecture

Feature Highlights

Phase 1 — Production RAG Engine

Phase 2 — Agentic Reasoning & Memory

Phase 3 — Engineering & Scale

Phase 4 — Embedded UI & Production Delivery

How to Run

Option A — Download a release (easiest)

Option B — Run from source

Option C — Run with Docker

Run tests

UI Guide

Chat tab

Log Ingestion tab

Playground tab

Agent tab

API Reference

Project Structure

CI / CD

Key Engineering Decisions

How Efficiency Is Engineered

Configuration & Constants Reference

Full Demo

Known Limitations

Roadmap

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages