Local-first lexical code search for repositories, delivered as one TypeScript core with three thin interfaces:
codesiftCLI@codesift/coreSDK@codesift/mcpserver
M5 interfaces are complete on top of the completed M3/M4 slices: streamable HTTP MCP transport with optional bearer token, opt-in cloud embedding providers (Voyage, OpenAI) with secret-scan/redaction, repo config + provider resolution + guided --rebuild, and a frozen SDK surface with typedoc.
Implemented today:
- repository scan with
.gitignore/.codesiftignoresupport - TS/JS structural chunking via the TypeScript compiler API
- Python structural definition chunking
- M3 parser-quality structural scanners + symbols for Go, Java, Ruby, and Rust
- accepted M3 decision: no tree-sitter WASM dependency yet; deterministic scanners avoid native/postinstall grammar pain and are hardened with comment/string masking plus real-world fixtures
- heading-aware Markdown chunks and top-level key/section chunks for JSON, YAML, and TOML
- fallback line-window chunking for other supported text files
- SQLite-backed local index with FTS and lazy
sqlite-vecloading - end-to-end
index,search,sym,grep,status, andcleanCLI flows - manifest-diff incremental
sync()/indexupdates for changed, touched, and removed files - content-addressed embedding cache so delete+add/rename cases with identical code reuse embeddings
- stable chunk ids plus on-demand chunk/range reads from disk
- real MCP stdio transport with
search_code,find_symbol,grep_code,read_chunk, andindex_status - token-budgeted compact search results (
maxTokens/max_tokens), overlap dedupe, single-best identifier answers, query-centered snippets, reason tags, and stale-hit annotations - oversized structural chunk splitting, nested ignore-file handling, default vendor ignores, generated/minified code down-ranking, generated result annotations, and generated counts in status
- real
status().stalefrom mtime/size manifest drift plus git branch/HEAD drift status().sync/ MCPindex_statuscrash-state metadata for running, failed, and aborted syncs- shadow database sync writes with atomic index-file swap so failed rebuilds keep the previous index readable
- native
fs.watch-basedwatch()/codesift index --watchwith a safety poll fallback, refreshing through the same incremental path - daemon-backed
codesift mcpshim: the CLI fast-path starts/proxies to a long-lived local daemon so repo handles and MCP routing are amortized across agent sessions - real streamable HTTP MCP transport (
codesift serve) over the same tool registry, binding127.0.0.1by default with an optional constant-time bearer token - opt-in cloud embedding providers
voyage-code-3andopenai-text-embedding-3-small(lazy key reads, no egress until an embed runs) behind the sameEmbeddingProviderinterface - secret-scan + redaction gate on the cloud document-embed path (
--allow-secrets); the local default path never egresses .codesift/config.json+codesift config get|setprovider/model/ignore/allowSecrets, with provider resolution precedence and guided--rebuildon a provider switch- frozen
@codesift/coreSDK surface with a typedoc reference (pnpm run docs) and a documented quickstart proven by a test - pinned-OSS + local M3 fixture eval harness with paired tokens-to-resolution plus stdio cold-start latency vs ripgrep and a checked-in loss budget
For the current one-call structural-search moat, including diagrams of the query path and efficiency model, see docs/moat.md.
Still intentionally deferred to later milestones:
- production-default learned embedding provider (cloud providers ship opt-in in M5; a default learned/local model is M6)
- optional future tree-sitter WASM migration if bundled grammars can be added without install pain
- sqlite-vec
vec0virtual-table vector arm (M6; gated on a default-supported learned provider — seePLAN.md§12.8/§12.11) - broader M6-quality golden sets and learned-vector ranking work
| Environment | Node | Notes |
|---|---|---|
macOS (macos-latest) |
20.x, 22.x | GitHub Actions coverage |
Ubuntu (ubuntu-latest) |
20.x, 22.x | glibc coverage |
Alpine Linux (node:<version>-alpine) |
20.x, 22.x | musl coverage + packed-install smoke test |
Windows (windows-latest) |
20.x, 22.x | GitHub Actions coverage |
- Telemetry: none.
- Default local
index/search/sym/statusflows are covered by offline / zero-egress CI checks. - Cloud embedding is opt-in only (explicit provider config + an API key env var); content is secret-scanned and refuses to send without
--allow-secrets, which redacts first. .codesift/self-installs a local.gitignorewith*on first open/index so the index never shows up ingit status.- If
sqlite-vecis unavailable, lexical and symbol queries still work; vector search reports degraded mode instead of failing at repo open.
- Default ignored directories include
vendor/,third_party/, and__generated__/; nested.gitignoreand.codesiftignorefiles are honored. - Generated files are detected from path patterns (
*.generated.*,*.gen.*,*.pb.go,*_pb2.py,*_pb.rb,*.designer.*), header markers (@generated,Code generated by,DO NOT EDIT, etc.), and minified shape (average nonblank line length >300 or any line >2000 chars). - Generated files outside ignored directories are indexed, not dropped: search down-ranks them, result formatters annotate them, and
statusreports generated file/chunk counts. - Oversized structural chunks split into bounded overlapping windows before indexing; Markdown headings and JSON/YAML/TOML top-level keys become named chunks.
packages/
core/ @codesift/core
cli/ codesift
mcp/ @codesift/mcp
eval/ private eval harness
Use Node 22 locally (.nvmrc) on macOS arm64 so better-sqlite3 can use its published prebuilds.
If you normally run with ignore-scripts=true, this repo overrides it via .npmrc because the native SQLite deps must run their install hooks.
pnpm install
pnpm build
pnpm test
node packages/cli/dist/bin.js index .
node packages/cli/dist/bin.js search "where is the sqlite database opened" -k 5
node packages/cli/dist/bin.js grep -e "SqliteRepo" --path 'packages/core/**'
node packages/cli/dist/bin.js sym SqliteRepoAfter indexing a repo, point an MCP client at the stdio command:
codesift mcp /path/to/repoRouting policy for agents: find_symbol for identifiers/definitions, grep_code for exact strings or regex, and search_code for behavior/concept queries. Keep host grep as fallback, not the first tool. search_code is compact by default and accepts max_tokens for strict context budgets.
Cold-start note: codesift mcp is now a small stdio shim that starts or reuses a local codesift daemon, then proxies MCP JSON-RPC to it. The daemon exits after an idle timeout and can be pinned with CODESIFT_DAEMON_SOCKET / CODESIFT_DAEMON_IDLE_MS when tests need isolation.
codesift serve /path/to/repo --port 7345 --token <bearer> # streamable HTTP MCP, binds 127.0.0.1serve exposes the same five tools over the MCP streamable-HTTP transport. It binds 127.0.0.1 by default; pass --host to widen and --token to require Authorization: Bearer <token> (compared in constant time). Use it for a second process/machine on a trusted network — multi-repo/team auth remains post-MVP.
Local lexical search is the zero-config default and never leaves the machine. To use a learned cloud provider:
export VOYAGE_API_KEY=... # or OPENAI_API_KEY
codesift config set provider voyage-code-3 # or openai-text-embedding-3-small
codesift index . --rebuild # rebuild with the new provider's vectorsBefore any cloud send, indexed content is secret-scanned: a detected secret aborts the sync unless you pass index --allow-secrets, which sends a redacted copy instead. API keys are read lazily and only on the embed path; the default local flows stay zero-egress (enforced by pnpm run test:offline). Config precedence: explicit SDK providerId > CODESIFT_EMBEDDING_PROVIDER env > .codesift/config.json > local default.
See docs/sdk.md for the frozen @codesift/core quickstart; pnpm run docs generates the full typedoc API reference into docs/api/.
pnpm build
pnpm test
pnpm typecheck
pnpm run test:offline
pnpm --filter @codesift/eval run bench
pnpm run test:smoke-install
pnpm run docs
pnpm run ciSee PLAN.md for the full product plan.