Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 0 additions & 30 deletions .github/workflows/pair-reviewer.yml

This file was deleted.

33 changes: 33 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,39 @@ All notable changes to VecGrep are documented here.

---

## [Unreleased]

### Added

- **Knowledge graph index** — `index_graph` builds a structural code graph from
any indexed codebase using tree-sitter AST extraction (no LLM required).
Extracts files, functions, classes, and methods as nodes; `contains`, `calls`,
`imports`, and `inherits` as directed edges. Graph is persisted as
`graph.json` alongside the vector index in `~/.vecgrep/<project>/`.

- **`search_graph` MCP tool** — keyword search over node labels (function names,
class names, file names). Returns matching nodes with kind, source location,
and connectivity degree.

- **`graph_neighbors` MCP tool** — given a node ID or label, returns its
direct structural neighborhood: callers, callees, imports, contains, and
inheritance edges. Supports `depth` up to 4 hops.

- **`hybrid_search` MCP tool** — blends vector similarity and graph proximity
into a single ranked result list. Score formula:
`α × vector_score + (1−α) × graph_score`. Both inputs are normalised to
`[0, 1]`. Requires both `index_codebase` and `index_graph` to have been run;
degrades gracefully to pure vector search if the graph index is absent.

- **`networkx>=3.2` dependency** — used for graph construction, BFS traversal,
and JSON serialisation via `networkx.readwrite.json_graph`.

- **`tree-sitter==0.21.3` pin** — pins tree-sitter to the version compatible
with `tree-sitter-languages 1.10.x` to prevent silent extraction failures
caused by the 0.22+ API break.

---

## [1.8.0] — 2026-05-19

### Added
Expand Down
93 changes: 89 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,48 @@ Cursor-style semantic code search as an MCP plugin for Claude Code.

Instead of grepping 50 files and sending 30,000 tokens to Claude, VecGrep returns the top 8 semantically relevant code chunks (~1,600 tokens). That's a **~95% token reduction** for codebase queries.

## Benchmarks

Measured on the VecGrep codebase itself (5 source files, ~26k tokens raw).

### Token usage per query

| Mode | Avg tokens returned | vs raw read | Savings |
|---|---|---|---|
| Raw file read (baseline) | 26,009 | — | — |
| `search_code` (top_k=8) | ~3,007 | 11.6% | **88%** |
| `hybrid_search` (top_k=8) | ~3,324 | 12.8% | **87%** |
| `search_graph` (limit=8) | ~47 | 0.2% | **>99%** |

`search_graph` returns structured node metadata only (name, kind, file, line range) — no source code — so it's ultra-cheap for structural questions ("where is X defined?", "what calls Y?").

### Query latency (median, 5 runs)

| Mode | Latency |
|---|---|
| `search_graph` | ~3ms |
| `hybrid_search` | ~76ms |
| `search_code` | ~83ms |

`search_graph` is ~30× faster than vector search — pure in-memory graph traversal, no embedding model call.

### Result correctness (structural queries)

For name-based structural queries, pure vector search can rank documentation (CHANGELOG, README) above source code. The graph index fixes this:

| Query | `search_code` #1 | `hybrid_search` #1 |
|---|---|---|
| "VectorStore search method" | [WRONG] CHANGELOG.md | [OK] store.py |
| "GraphStore build" | [WRONG] CHANGELOG.md | [OK] server.py |
| "embedding provider factory" | [OK] embedder.py | [OK] embedder.py |
| "AST chunking tree-sitter" | [OK] chunker.py | [OK] chunker.py |

The graph score (`graph_score: 1.00`) overrides a misleading vector match whenever the query directly names a known symbol.

> **Rule of thumb:** use `search_code` for semantic/behaviour queries, `search_graph` for structural/navigation queries, `hybrid_search` when you need both.

---

## How it works

1. **Chunk** — Parses source files with tree-sitter to extract semantic units (functions, classes, methods)
Expand Down Expand Up @@ -55,6 +97,9 @@ You don't trigger VecGrep manually - Claude decides when to call the tools based
| "How does authentication work in this codebase?" | `search_code` |
| "Find where database connections are set up" | `search_code` |
| "How many files are indexed?" | `get_index_status` |
| "Build a knowledge graph of my project" | `index_graph` |
| "What calls the VectorStore.search method?" | `search_graph` + `graph_neighbors` |
| "Find code structurally related to authentication" | `hybrid_search` |

**Typical first-time flow:**

Expand Down Expand Up @@ -119,6 +164,46 @@ Index status for: /path/to/myproject
Dimensions: 384
```

### `index_graph(path, force=False)`

Build a structural knowledge graph from the codebase using tree-sitter AST extraction. No LLM required — extracts files, functions, classes, and methods as nodes; `contains`, `calls`, `imports`, and `inherits` as directed edges. Independent of the vector index.

```
index_graph("/path/to/myproject")
# → "Graph built: 496 nodes, 1251 edges, 35 files processed."
```

### `search_graph(query, path, limit=20)`

Keyword search over node labels (function names, class names, file names). Returns structural nodes with source location and connectivity degree. Ultra-cheap: ~47 tokens average, ~3ms latency.

```
search_graph("VectorStore", "/path/to/myproject")
# → [1] CLASS VectorStore (score: 1.00, degree: 39)
# src/vecgrep/store.py:49-352
```

### `graph_neighbors(node_id, path, depth=1)`

Return the structural neighbourhood of any node — callers, callees, imports, contained methods, and inheritance edges. Use `search_graph` first to find the node ID.

```
graph_neighbors("VectorStore", "/path/to/myproject", depth=1)
# → Callers (18): _get_store, migrate_project, test fixtures...
# Contains (18): search, add_chunks, replace_file_chunks...
```

### `hybrid_search(query, path, top_k=8, alpha=0.6, min_score=0.0)`

Vector similarity search re-ranked by graph proximity. Final score = `alpha * vector_score + (1 - alpha) * graph_score`. Fixes cases where documentation ranks above source code on pure embedding similarity.

```
hybrid_search("VectorStore search method", "/path/to/myproject", alpha=0.6)
# → [1] src/vecgrep/store.py:292-320 (blended: 0.70, vec: 0.49, graph: 1.00)
```

Requires both `index_codebase` and `index_graph` to have been run. Degrades gracefully to pure vector search if the graph index is absent.

## Configuration

VecGrep can be tuned via environment variables:
Expand Down Expand Up @@ -217,7 +302,7 @@ The embedding model used by VecGrep is [`all-MiniLM-L6-v2-code-search-512`](http

| | |
|---|---|
| **Questions** | [Start a Q&A discussion](https://github.com/VecGrep/VecGrep/discussions/new?category=q-a) |
| 💡 **Ideas** | [Share an idea](https://github.com/VecGrep/VecGrep/discussions/new?category=ideas) |
| 🚀 **Show & Tell** | [Share how you use VecGrep](https://github.com/VecGrep/VecGrep/discussions/new?category=show-and-tell) |
| 🐛 **Bugs** | [Open an issue](https://github.com/VecGrep/VecGrep/issues/new) |
| ? **Questions** | [Start a Q&A discussion](https://github.com/VecGrep/VecGrep/discussions/new?category=q-a) |
| + **Ideas** | [Share an idea](https://github.com/VecGrep/VecGrep/discussions/new?category=ideas) |
| > **Show & Tell** | [Share how you use VecGrep](https://github.com/VecGrep/VecGrep/discussions/new?category=show-and-tell) |
| ! **Bugs** | [Open an issue](https://github.com/VecGrep/VecGrep/issues/new) |
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ dependencies = [
"lancedb>=0.6,<1.0",
"pyarrow>=14.0",
"watchdog>=4.0,<5.0",
"networkx>=3.2",
"tree-sitter==0.21.3",
]

[project.urls]
Expand Down
Loading
Loading