Problem
CodeLens currently supports ~10 languages with hand-written parsers per language. Each new language requires a new parser file. This doesn't scale, and agents working on Go, Rust (beyond current support), Java, C++, etc. get no structural analysis.
Proposed Approach: Universal Grammar Loader
tree-sitter has grammars for 150+ languages available as npm packages. Instead of writing per-language parsers, write one generic extraction layer that works on any tree-sitter grammar:
UNIVERSAL_NODE_TYPES = {
# Maps tree-sitter node type names -> CodeLens concept
'function_definition': 'Function',
'function_declaration': 'Function',
'method_definition': 'Method',
'class_declaration': 'Class',
'class_definition': 'Class',
'import_statement': 'Import',
'call_expression': 'CallSite',
# ... etc
}
Many tree-sitter grammars use similar node type names — a universal mapper covers 80% of languages with ~100 lines of code. Language-specific overrides handle the remaining 20%.
Implementation Steps
- Write
universal_parser.py that loads any tree-sitter-{lang} grammar dynamically
- Define the node type mapping table
- Add language -> grammar package mapping (e.g.
go -> tree-sitter-go)
- Install grammars on-demand via npm/pip in
setup.sh
- Language-specific override files for languages where naming differs (Ruby, Haskell, etc.)
Custom Kernel Idea (long-term)
For maximum performance, compile all grammars into a single shared Python extension (.so file) using tree-sitter's Language.build_library(). This eliminates per-process grammar loading overhead and could be distributed as a wheel.
This is essentially what codebase-memory-mcp does in C — vendoring all 158 grammars into one binary. The Python equivalent is a compiled .so with all languages baked in.
Problem
CodeLens currently supports ~10 languages with hand-written parsers per language. Each new language requires a new parser file. This doesn't scale, and agents working on Go, Rust (beyond current support), Java, C++, etc. get no structural analysis.
Proposed Approach: Universal Grammar Loader
tree-sitter has grammars for 150+ languages available as npm packages. Instead of writing per-language parsers, write one generic extraction layer that works on any tree-sitter grammar:
Many tree-sitter grammars use similar node type names — a universal mapper covers 80% of languages with ~100 lines of code. Language-specific overrides handle the remaining 20%.
Implementation Steps
universal_parser.pythat loads anytree-sitter-{lang}grammar dynamicallygo->tree-sitter-go)setup.shCustom Kernel Idea (long-term)
For maximum performance, compile all grammars into a single shared Python extension (.so file) using
tree-sitter's Language.build_library(). This eliminates per-process grammar loading overhead and could be distributed as a wheel.This is essentially what codebase-memory-mcp does in C — vendoring all 158 grammars into one binary. The Python equivalent is a compiled .so with all languages baked in.