Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
245 changes: 245 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
# CLAUDE.md — DocsContext Development Guide

## Project Overview

DocsContext is a GraphRAG-powered documentation search tool written in Go. It indexes documents (PDF, DOCX, TXT, MD, web pages) into a knowledge graph with entity extraction, community detection, and vector embeddings, then answers queries using a combination of graph search and vector similarity.

## Build & Test

```bash
go build ./...
go test ./...
go run . --help
```

## Architecture

```
cmd/ CLI commands (cobra): index, serve, search, version
internal/
api/ REST API handlers
chunker/ Text splitting into overlapping chunks
community/ Louvain community detection + summarization
config/ Viper-based YAML config loading
crawler/ Web page crawler
embedder/ Batched text → vector embedding
extractor/ LLM-based entity/relationship/claims extraction
llm/ LLM provider abstraction (Azure OpenAI, Ollama)
loader/ Document loaders (PDF, DOCX, TXT, MD, web)
mcp/ Model Context Protocol server
pipeline/ 5-phase GraphRAG indexing pipeline
search/ Query engine (local + global search)
store/ SQLite storage layer
```

## Supported LLM Providers

Only **Azure OpenAI** and **Ollama** are supported. HuggingFace was removed.

## Recent Changes (already committed)

The following improvements are already committed to the branch `claude/fix-codecontext-config-DR15O`:

1. **Config fix**: Loads config from `~/.docscontext/` (lowercase) and supports both `.yaml` and `.yml`
2. **HuggingFace removal**: Dropped HuggingFace provider, config struct, and defaults
3. **GraphRAG quality improvements** (aligned with Microsoft GraphRAG):
- **Gleanings**: Multi-pass entity extraction in `internal/extractor/entities.go` (configurable via `indexing.max_gleanings`, default: 1)
- **Improved extraction prompt**: Few-shot examples, 10 entity types, weight guidance, implicit relationship extraction
- **Entity name normalization**: Case-insensitive dedup in `internal/pipeline/pipeline.go`
- **Relationship deduplication**: By (source, target, predicate) in pipeline
- **Fixed Louvain modularity formula**: Correct ΔQ calculation in `internal/community/louvain.go`

## Remaining Task: langchaingo Integration

Replace the custom HTTP-based LLM provider implementations with [langchaingo](https://github.com/tmc/langchaingo) (v0.1.14+).

### Why

The current `internal/llm/azure.go` and `internal/llm/ollama.go` are ~250 lines of manual HTTP client code. langchaingo provides battle-tested implementations with proper error handling and retries.

### Step 1: Add langchaingo dependency

```bash
go get github.com/tmc/langchaingo@latest
```

### Step 2: Rewrite `internal/llm/provider.go`

Keep the existing `Provider` interface unchanged. Replace the implementations with a single `lcProvider` struct that wraps langchaingo:

```go
package llm

import (
"context"
"fmt"

"github.com/RandomCodeSpace/docscontext/internal/config"
"github.com/tmc/langchaingo/embeddings"
"github.com/tmc/langchaingo/llms"
"github.com/tmc/langchaingo/llms/ollama"
"github.com/tmc/langchaingo/llms/openai"
)

// lcProvider adapts langchaingo to our Provider interface.
type lcProvider struct {
llm llms.Model
emb embeddings.Embedder
name string
modelID string
}

func (p *lcProvider) Name() string { return p.name }
func (p *lcProvider) ModelID() string { return p.modelID }

func (p *lcProvider) Complete(ctx context.Context, prompt string, opts ...Option) (string, error) {
o := applyOptions(opts)
callOpts := []llms.CallOption{
llms.WithMaxTokens(o.maxTokens),
llms.WithTemperature(o.temperature),
}
if o.jsonMode {
callOpts = append(callOpts, llms.WithJSONMode())
}
return llms.GenerateFromSinglePrompt(ctx, p.llm, prompt, callOpts...)
}

func (p *lcProvider) Embed(ctx context.Context, text string) ([]float32, error) {
return p.emb.EmbedQuery(ctx, text)
}

func (p *lcProvider) EmbedBatch(ctx context.Context, texts []string) ([][]float32, error) {
return p.emb.EmbedDocuments(ctx, texts)
}
```

#### Ollama factory:

```go
func newOllamaProvider(cfg *config.LLMConfig) (Provider, error) {
chatLLM, err := ollama.New(
ollama.WithServerURL(cfg.Ollama.BaseURL),
ollama.WithModel(cfg.Ollama.ChatModel),
)
if err != nil {
return nil, fmt.Errorf("ollama chat LLM: %w", err)
}
embedLLM, err := ollama.New(
ollama.WithServerURL(cfg.Ollama.BaseURL),
ollama.WithModel(cfg.Ollama.EmbedModel),
)
if err != nil {
return nil, fmt.Errorf("ollama embed LLM: %w", err)
}
emb, err := embeddings.NewEmbedder(embedLLM)
if err != nil {
return nil, fmt.Errorf("ollama embedder: %w", err)
}
return &lcProvider{llm: chatLLM, emb: emb, name: "ollama", modelID: cfg.Ollama.EmbedModel}, nil
}
```

#### Azure factory:

```go
func newAzureProvider(cfg *config.LLMConfig) (Provider, error) {
chatLLM, err := openai.New(
openai.WithBaseURL(cfg.Azure.Endpoint),
openai.WithToken(cfg.Azure.APIKey),
openai.WithAPIVersion(cfg.Azure.APIVersion),
openai.WithAPIType(openai.APITypeAzure),
openai.WithModel(cfg.Azure.ChatModel),
openai.WithEmbeddingModel(cfg.Azure.EmbedModel),
)
if err != nil {
return nil, fmt.Errorf("azure openai LLM: %w", err)
}
emb, err := embeddings.NewEmbedder(chatLLM)
if err != nil {
return nil, fmt.Errorf("azure openai embedder: %w", err)
}
return &lcProvider{llm: chatLLM, emb: emb, name: "azure", modelID: cfg.Azure.EmbedModel}, nil
}
```

### Step 3: Delete old implementations

```bash
rm internal/llm/azure.go internal/llm/ollama.go
```

### Step 4: Replace `internal/chunker/chunker.go` with langchaingo textsplitter

```go
package chunker

import (
"unicode/utf8"
"github.com/tmc/langchaingo/textsplitter"
)

type Chunk struct {
Index int
Content string
Tokens int
}

type Chunker struct {
splitter textsplitter.RecursiveCharacter
}

func New(chunkSize, chunkOverlap int) *Chunker {
return &Chunker{
splitter: textsplitter.NewRecursiveCharacter(
textsplitter.WithChunkSize(chunkSize),
textsplitter.WithChunkOverlap(chunkOverlap),
textsplitter.WithSeparators([]string{"\n\n", "\n", ". ", " ", ""}),
),
}
}

func (c *Chunker) Split(text string) []Chunk {
parts, err := c.splitter.SplitText(text)
if err != nil {
return []Chunk{{Index: 0, Content: text, Tokens: estimateTokens(text)}}
}
chunks := make([]Chunk, len(parts))
for i, p := range parts {
chunks[i] = Chunk{Index: i, Content: p, Tokens: estimateTokens(p)}
}
return chunks
}

func estimateTokens(text string) int {
return utf8.RuneCountInString(text) / 4
}
```

### Step 5: No changes needed for embedder

`internal/embedder/embedder.go` delegates to `Provider.EmbedBatch()` — it works as-is since the `Provider` interface is unchanged.

### Step 6: Build and verify

```bash
go mod tidy
go build ./...
go test ./...
```

### Important langchaingo API notes

- `llms.GenerateFromSinglePrompt()` — sends a single prompt and returns the text response
- `embeddings.NewEmbedder(client)` — wraps any LLM with `CreateEmbedding()` into an `Embedder`
- `embeddings.Embedder.EmbedDocuments()` returns `[][]float32` (not float64)
- `embeddings.Embedder.EmbedQuery()` returns `[]float32`
- Ollama's `LLM` and OpenAI's `LLM` both implement `CreateEmbedding(ctx, []string) ([][]float32, error)`
- OpenAI package supports Azure via `openai.WithAPIType(openai.APITypeAzure)`
- OpenAI package supports separate embedding model via `openai.WithEmbeddingModel()`

## Code Style

- Use `slog` for logging with emoji prefixes (📄 ✅ ⚠️ ❌ 🔗 🧩 💾 🌐 ⏭️ ⚙️)
- Error wrapping: `fmt.Errorf("context: %w", err)`
- Concurrency: use semaphore channels (`make(chan struct{}, N)`) for limiting parallelism
- Config: Viper with `mapstructure` tags, env prefix `DocsContext`
9 changes: 2 additions & 7 deletions config.example.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
data_dir: ~/.DocsContext/data # stores DocsContext.db

llm:
provider: ollama # azure | ollama | huggingface
provider: ollama # azure | ollama

azure:
endpoint: https://myresource.openai.azure.com
Expand All @@ -15,19 +15,14 @@ llm:
chat_model: llama3.2
embed_model: nomic-embed-text

huggingface:
base_url: http://localhost:8000 # TGI local endpoint
api_key: ${HF_API_KEY}
chat_model: mistralai/Mistral-7B-Instruct-v0.3
embed_model: sentence-transformers/all-MiniLM-L6-v2

indexing:
chunk_size: 512
chunk_overlap: 50
batch_size: 20
workers: 4
extract_graph: true
extract_claims: true
max_gleanings: 1 # gleaning passes for entity extraction (0=single pass)

community:
min_community_size: 3
Expand Down
65 changes: 42 additions & 23 deletions internal/community/louvain.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import (

// Graph represents an undirected weighted graph for community detection.
type Graph struct {
Nodes []string
Nodes []string
nodeIndex map[string]int
Edges []Edge
adjMatrix [][]float64 // dense for simplicity
Expand Down Expand Up @@ -61,6 +61,16 @@ func (g *Graph) NodeIndex(id string) (int, bool) {

// Louvain runs the Louvain community detection algorithm.
// Returns a map from node index → community ID (integer).
//
// Uses the standard modularity gain formula:
//
// ΔQ = [k_i_in / (2m)] - [sigma_tot * k_i / (2m²)]
//
// where:
// - k_i_in = sum of edge weights from node i to nodes in community C
// - sigma_tot = sum of all edge weights incident to nodes in community C
// - k_i = weighted degree of node i
// - m = total edge weight of the graph
func Louvain(g *Graph, maxIter int) []int {
n := len(g.Nodes)
if n == 0 {
Expand All @@ -77,44 +87,63 @@ func Louvain(g *Graph, maxIter int) []int {
return comm
}

m := g.totalWeight // total edge weight
m2 := 2.0 * m // 2m, used frequently

// Precompute node degrees
degree := make([]float64, n)
for i := 0; i < n; i++ {
degree[i] = g.nodeDegree(i)
}

// Community total degree (sigma_tot): sum of degrees of all nodes in community
sigmaTot := make(map[int]float64, n)
for i := 0; i < n; i++ {
sigmaTot[comm[i]] += degree[i]
}

improved := true
for iter := 0; iter < maxIter && improved; iter++ {
improved = false
// Random order
order := rand.Perm(n)
for _, i := range order {
bestComm := comm[i]
bestGain := 0.0
ki := degree[i]
oldComm := comm[i]

// Neighbor communities
// Compute weights from node i to each neighboring community
neighborComms := map[int]float64{}
for j := 0; j < n; j++ {
if g.adjMatrix[i][j] > 0 {
neighborComms[comm[j]] += g.adjMatrix[i][j]
}
}

// Current community weight (excluding i)
ki := g.nodeDegree(i)
// Remove node i from its current community for gain calculation
sigmaTot[oldComm] -= ki

// Remove i from current community
oldComm := comm[i]
comm[i] = -1
// Gain of removing node i from its current community
kiOld := neighborComms[oldComm] // edges from i to old community (after removal)
removeLoss := kiOld/m2 - (sigmaTot[oldComm]*ki)/(m2*m2)

for c, w := range neighborComms {
// Modularity gain (simplified)
sigmaC := g.communityDegree(comm, c)
gain := w - (ki*sigmaC)/(2*g.totalWeight)
for c, kiIn := range neighborComms {
// Gain of adding node i to community c
addGain := kiIn/m2 - (sigmaTot[c]*ki)/(m2*m2)
gain := addGain - removeLoss
if gain > bestGain {
bestGain = gain
bestComm = c
}
}

// Move node i to best community
comm[i] = bestComm
sigmaTot[bestComm] += ki

if bestComm != oldComm {
improved = true
}
comm[i] = bestComm
}
}

Expand All @@ -140,16 +169,6 @@ func (g *Graph) nodeDegree(i int) float64 {
return d
}

func (g *Graph) communityDegree(comm []int, c int) float64 {
var d float64
for i, ci := range comm {
if ci == c {
d += g.nodeDegree(i)
}
}
return d
}

// HierarchicalLouvain runs Louvain at multiple levels.
// Returns a slice of levels, each level is a map nodeID → communityLabel.
func HierarchicalLouvain(g *Graph, maxLevels, maxIter int) [][]int {
Expand Down
Loading
Loading