Skip to content

poc: WASM/wazero tree-sitter backend (speed + stability vs cgo PR #80)#81

Draft
dvcdsys wants to merge 1 commit into
developfrom
feat/chunker-wasm-treesitter
Draft

poc: WASM/wazero tree-sitter backend (speed + stability vs cgo PR #80)#81
dvcdsys wants to merge 1 commit into
developfrom
feat/chunker-wasm-treesitter

Conversation

@dvcdsys
Copy link
Copy Markdown
Owner

@dvcdsys dvcdsys commented Jun 7, 2026

Draft / PoC for comparison — not for merge. Alternative to the cgo backend in #80, to decide direction.

Official tree-sitter C runtime + TypeScript grammar → standalone wasm32-wasi module (zig cc), driven from Go via wazero. No cgo, no JS, no third-party parser — only the wazero host (poc/wasm-treesitter/wasmts.go) is ours.

Speed — same 852-file vscode TS corpus, full-tree walk

backend wall files/s ERROR trees editorOptions.ts
gotreesitter (pure-Go) 13.83s 62 13 8.77s → ERROR
WASM (wazero) ~2.5s ~330 0 49ms
cgo (native, #80) 1.26s 675 0 17ms

~2× slower than cgo, ~5× faster than gotreesitter, correct. Overhead is the per-node host↔guest call boundary (mitigable with a batched subtree export).

Stability

tree-sitter is robust on adversarial input under both backends. WASM additionally contains guest faults (resource/trap → recoverable Go error, host alive) where cgo would SIGSEGV the whole process. Insurance vs unknown C bugs.

Decision framing

~2× parse cost (largely invisible end-to-end — embeddings dominate) in exchange for CGO_ENABLED=0 builds, crash-isolation, and a likely smaller binary. Cost: engineering effort to build/bundle all 31 grammars + flesh out the node API. Full write-up in poc/wasm-treesitter/README.md.

🤖 Generated with Claude Code

Alternative to feat/chunker-cgo-treesitter: the official tree-sitter C runtime
+ TypeScript grammar compiled to a standalone wasm32-wasi reactor module
(build.sh, via zig cc) and driven from Go through wazero — no cgo, no JS, no
third-party parser. Only the wazero host (wasmts.go) is bespoke; the parser is
unmodified upstream C. wasm_store.c is gated by TREE_SITTER_FEATURE_WASM (we
don't define it), so the stock amalgamation compiles to wasi with no stubs.

Measured on the same 852-file vscode TypeScript corpus (full-tree walk):

  backend                     wall    files/s  ERROR trees  editorOptions.ts
  gotreesitter (pure-Go)     13.83s     62        13        8.77s -> ERROR
  WASM (wazero, pure-Go)     ~2.5s     ~330        0         49ms
  cgo (native)                1.26s    675         0         17ms

- WASM ~2x slower than cgo, ~5x faster than gotreesitter, correct (0 errors).
- Overhead is the per-node host<->guest call boundary (~3 calls/node x 2.68M
  nodes), not memory — slot-pooling barely moved it. A batched "serialize
  subtree" export would close most of the gap (future work).
- Stability: tree-sitter is robust on adversarial input under both backends;
  WASM additionally CONTAINS faults (resource/guest trap -> recoverable Go
  error, host alive) where cgo would SIGSEGV the whole process. Insurance vs
  unknown C bugs, not a fix for an observed crash.

Trade-off vs cgo: ~2x parse cost (largely invisible end-to-end since embeddings
dominate) in exchange for CGO_ENABLED=0 builds, crash-isolation, and a likely
smaller binary; cost is the engineering effort to build/bundle all 31 grammars
and flesh out the node API. README.md has the full comparison.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant