Skip to content

perf(parser): replace regex tokenizer with state-machine + keyword trie#12

Merged
imrim12 merged 1 commit into
mainfrom
claude/state-machine-parser-cVot3
Apr 27, 2026
Merged

perf(parser): replace regex tokenizer with state-machine + keyword trie#12
imrim12 merged 1 commit into
mainfrom
claude/state-machine-parser-cVot3

Conversation

@imrim12
Copy link
Copy Markdown
Owner

@imrim12 imrim12 commented Apr 27, 2026

Introduces TokenizerFSM, a hand-rolled state machine that tokenizes via:

  • a character-keyed trie for English/Vietnamese (multi-word) keywords with
    per-entry boundary rules (\b for ASCII, custom for VI, none for else/return/
    try/as/from/const/async — matching the original regex spec exactly);
  • bounded backtracking only when distinguishing multi-word identifiers from
    multi-word keywords (peek next word, rewind on no match);
  • direct char-code dispatch for whitespace/comments/numbers/operators,
    eliminating the per-token linear scan over ~80 regex specs and the
    String#slice + concat in the regex tokenizer's hot path.

Wired through Parser via a tokenizerKind: 'fsm' | 'regex' option (default
'fsm'). The legacy Tokenizer is kept side-by-side for parity testing and
rollback.

Tests: 402/402 pass (regex baseline was 347; +55 new tests for fixtures,
parity, snapshots, and the FSM tokenizer itself).

Bench (vitest bench, side-by-side):
tiny 1.8k hz → 364k hz (~206x)
medium 48 hz → 26k hz (~546x)
keywordHeavy 33 hz → 20k hz (~609x)
stringHeavy 143 hz → 69k hz (~484x)
large (14k) 1.1 hz → 3.3k hz (~2969x — eliminates super-linear regex behavior)

Adds a benchmark harness (vitest bench, fixtures + tokenizer + parser benches),
captured baseline JSON, and token-stream snapshots so future tokenizer changes
are guarded against silent drift.

https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o

Introduces TokenizerFSM, a hand-rolled state machine that tokenizes via:
- a character-keyed trie for English/Vietnamese (multi-word) keywords with
  per-entry boundary rules (\b for ASCII, custom for VI, none for else/return/
  try/as/from/const/async — matching the original regex spec exactly);
- bounded backtracking only when distinguishing multi-word identifiers from
  multi-word keywords (peek next word, rewind on no match);
- direct char-code dispatch for whitespace/comments/numbers/operators,
  eliminating the per-token linear scan over ~80 regex specs and the
  String#slice + concat in the regex tokenizer's hot path.

Wired through Parser via a `tokenizerKind: 'fsm' | 'regex'` option (default
'fsm'). The legacy Tokenizer is kept side-by-side for parity testing and
rollback.

Tests: 402/402 pass (regex baseline was 347; +55 new tests for fixtures,
parity, snapshots, and the FSM tokenizer itself).

Bench (vitest bench, side-by-side):
  tiny           1.8k hz → 364k hz   (~206x)
  medium          48 hz → 26k hz     (~546x)
  keywordHeavy    33 hz → 20k hz     (~609x)
  stringHeavy    143 hz → 69k hz     (~484x)
  large (14k)    1.1 hz → 3.3k hz    (~2969x — eliminates super-linear regex behavior)

Adds a benchmark harness (vitest bench, fixtures + tokenizer + parser benches),
captured baseline JSON, and token-stream snapshots so future tokenizer changes
are guarded against silent drift.

https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o
@imrim12 imrim12 merged commit cf86a4e into main Apr 27, 2026
1 of 9 checks passed
@imrim12 imrim12 deleted the claude/state-machine-parser-cVot3 branch May 1, 2026 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants