perf(parser): replace regex tokenizer with state-machine + keyword trie by imrim12 · Pull Request #12 · imrim12/vietscript

imrim12 · 2026-04-27T19:28:39Z

Introduces TokenizerFSM, a hand-rolled state machine that tokenizes via:

a character-keyed trie for English/Vietnamese (multi-word) keywords with
per-entry boundary rules (\b for ASCII, custom for VI, none for else/return/
try/as/from/const/async — matching the original regex spec exactly);
bounded backtracking only when distinguishing multi-word identifiers from
multi-word keywords (peek next word, rewind on no match);
direct char-code dispatch for whitespace/comments/numbers/operators,
eliminating the per-token linear scan over ~80 regex specs and the
String#slice + concat in the regex tokenizer's hot path.

Wired through Parser via a tokenizerKind: 'fsm' | 'regex' option (default
'fsm'). The legacy Tokenizer is kept side-by-side for parity testing and
rollback.

Tests: 402/402 pass (regex baseline was 347; +55 new tests for fixtures,
parity, snapshots, and the FSM tokenizer itself).

Bench (vitest bench, side-by-side):
tiny 1.8k hz → 364k hz (~206x)
medium 48 hz → 26k hz (~546x)
keywordHeavy 33 hz → 20k hz (~609x)
stringHeavy 143 hz → 69k hz (~484x)
large (14k) 1.1 hz → 3.3k hz (~2969x — eliminates super-linear regex behavior)

Adds a benchmark harness (vitest bench, fixtures + tokenizer + parser benches),
captured baseline JSON, and token-stream snapshots so future tokenizer changes
are guarded against silent drift.

https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o

Introduces TokenizerFSM, a hand-rolled state machine that tokenizes via: - a character-keyed trie for English/Vietnamese (multi-word) keywords with per-entry boundary rules (\b for ASCII, custom for VI, none for else/return/ try/as/from/const/async — matching the original regex spec exactly); - bounded backtracking only when distinguishing multi-word identifiers from multi-word keywords (peek next word, rewind on no match); - direct char-code dispatch for whitespace/comments/numbers/operators, eliminating the per-token linear scan over ~80 regex specs and the String#slice + concat in the regex tokenizer's hot path. Wired through Parser via a `tokenizerKind: 'fsm' | 'regex'` option (default 'fsm'). The legacy Tokenizer is kept side-by-side for parity testing and rollback. Tests: 402/402 pass (regex baseline was 347; +55 new tests for fixtures, parity, snapshots, and the FSM tokenizer itself). Bench (vitest bench, side-by-side): tiny 1.8k hz → 364k hz (~206x) medium 48 hz → 26k hz (~546x) keywordHeavy 33 hz → 20k hz (~609x) stringHeavy 143 hz → 69k hz (~484x) large (14k) 1.1 hz → 3.3k hz (~2969x — eliminates super-linear regex behavior) Adds a benchmark harness (vitest bench, fixtures + tokenizer + parser benches), captured baseline JSON, and token-stream snapshots so future tokenizer changes are guarded against silent drift. https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o

imrim12 merged commit cf86a4e into main Apr 27, 2026
1 of 9 checks passed

imrim12 deleted the claude/state-machine-parser-cVot3 branch May 1, 2026 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(parser): replace regex tokenizer with state-machine + keyword trie#12

perf(parser): replace regex tokenizer with state-machine + keyword trie#12
imrim12 merged 1 commit into
mainfrom
claude/state-machine-parser-cVot3

imrim12 commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

imrim12 commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants