feat(chunker): migrate to official tree-sitter via cgo#80
Open
dvcdsys wants to merge 2 commits into
Open
Conversation
Replace the pure-Go gotreesitter parser with the official tree-sitter (github.com/tree-sitter/go-tree-sitter v0.25) compiled via cgo. gotreesitter produced ERROR trees and ~650x slowdowns on large valid TypeScript files (e.g. vscode editorOptions.ts: 8.5s -> ERROR vs official 13ms -> 0 errors) and had a C `enum` GLR regression; the official parser fixes both. Grammars: 25 languages are consumed as upstream Go modules (their bindings/go package compiles parser.c via cgo from the module cache, so the C stays out of this repo). Six holdouts with no usable Go binding are vendored in-tree under internal/chunker/tsgrammars/ (markdown, objc, scss, solidity, r — binding omits its external scanner; sql — no committed generated parser.c). See vendor.sh. - chunker.go: new ts API (Kind() vs Type(lang), ParseWithOptions progress callback for the wall-clock deadline replacing the deprecated cancellation flag, tree/parser Close()); registry factories wrap the C TSLanguage via ts.NewLanguage; languages without a vendored/module grammar degrade to sliding-window instead of erroring. - sql node map updated to the DerekStride grammar (create_table/create_function rather than the *_statement names the old grammar exposed). - C enum regression test un-skipped (now passes); all 31 languages validated by TestRegistry_NodeNamesMatchAST. Build stays a static binary (cgo links the grammar C statically); CPU image remains distroless/static. Binary grows ~41MB -> ~78MB from the compiled grammar tables. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The chunker now uses cgo (tree-sitter grammars are C), so the images can no
longer build with CGO_ENABLED=0. Both Dockerfiles already use the musl-based
golang:1.25-alpine builder, so add gcc+musl-dev and link statically against
musl (-linkmode external -extldflags -static, tags osusergo netgo). The result
is still a loader-free static binary:
- CPU image stays on distroless/static-debian12 (verified: `file` reports
"statically linked", container boots, /health returns {"status":"ok"}, all 31
chunker languages register).
- CUDA image stays on distroless/cc-debian13; the static cix-server runs there
unchanged (its glibc is only for the llama-server sidecar).
No runtime/behaviour change for operators — same base images, same healthcheck.
The binary grows ~41MB -> ~78MB from the compiled grammar tables.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Replaces the pure-Go
gotreesitterparser with the official tree-sitter (github.com/tree-sitter/go-tree-sitterv0.25) compiled via cgo.Why
gotreesitterproduced ERROR trees and ~650× slowdowns on large valid TypeScript files and had a CenumGLR regression. Measured on vscode files:editorOptions.ts(250 KB)extHostTypeConverters.tsThis was also the root cause of the prod 100 GB OOM when indexing
microsoft/vscode(see [project_gotreesitter_oom] memory / earlier #76/#78 work) — the C parser frees per-tree and stays bounded.How
Grammars (31 languages, no regression):
bindings/gocompilesparser.cvia cgo from the module cache, so the C stays out of this repo.internal/chunker/tsgrammars/because no usable Go binding exists:markdown,objc,scss,solidity(nobindings/go),sql(no committed generatedparser.c),r(binding omits its external scanner → link failure). Seetsgrammars/vendor.shfor the pins.Chunker: new ts API (
Kind(),ParseWithOptionsprogress-callback deadline replacing the deprecated cancellation flag,tree/parserClose()); registry wraps the CTSLanguageviats.NewLanguage; languages without a grammar degrade to sliding-window instead of erroring. SQL node map updated to the DerekStride grammar (create_table/create_function). C-enum regression test un-skipped (now passes).Build — Docker users notice nothing: both Dockerfiles already use the musl-based
golang:1.25-alpinebuilder, so we addgcc musl-devand link statically against musl (-linkmode external -extldflags -static, tagsosusergo netgo). The binary stays loader-free:distroless/static-debian12— verified:filereportsstatically linked, container boots,/health={"status":"ok"}, all 31 languages register.distroless/cc-debian13(the static binary runs there unchanged).Testing
go test -race ./...— all packages green, 0 FAIL. All 31 languages validated byTestRegistry_NodeNamesMatchAST.editorOptions.ts— which used to fall to sliding-window with zero symbols — now yields real symbols (cix def IEditorOptions→[type],EditorBooleanOption/ApplyUpdateResult→[class]).github.com/microsoft/vscode@mainviaPOST /api/v1/projects/{hash}/reindex— the exact prod path that OOM'd. RSS stayed bounded ~2.1 GB (was 100 GB) throughout, productively chunking+embedding, cleanly force-stoppable. (Stopped before completion — full index is hours on a local embedder.)Notes / follow-ups
make scout-cuda) — not buildable in this environment.SQLITE_BUSYjob-claim contention when acix watchdaemon runs concurrently with an index job — unrelated to this change, worth a separate look.🤖 Generated with Claude Code