docs: document state-machine tokenizer architecture#13
Merged
Conversation
- New page docs/architecture/tokenizer.md covering FSM design (char-code dispatch, keyword trie with per-entry boundary rules, bounded backtracking for multi-word identifier vs keyword, operator longest-match), parity testing model, and bench harness - Wire it into the vitepress sidebar under a new "Kiến trúc" section - README: add tokenizer perf row to feature table, link to the new doc, update test count (249 → 402), add `pnpm bench` to dev commands - getting-started.md: mention `pnpm bench` / `pnpm bench:baseline` and link to the architecture page - roadmap.md: add an "off-roadmap update" note pointing to the new doc - CHANGELOG.md: log the perf migration with bench numbers https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o
Now that the state-machine tokenizer is the only one, remove the dual implementation and present the codebase as if it were the first version. - Delete `tokenizer-fsm.ts` and inline its content into `tokenizer.ts`, renaming the class from `TokenizerFSM` to `Tokenizer`. - Remove `ITokenizer`, `TokenizerKind`, `ParserOptions`, `createTokenizer` factory from `parser.ts`. `Parser` constructor takes no options again and always uses `new Tokenizer(this)`. - Delete `packages/parser/src/constants/specs.ts` (regex spec table) and the now-unused `Spec` type from `@vietscript/shared`. - Drop `tokenizer-fsm.test.ts` (parity-vs-regex tests no longer apply); remaining tokenizer behavior is covered by tokenizer-edge, vietnamese-keywords, identifier-match-keyword, plus the snapshot drift / fixture parity smoke tests. - Simplify benches: tokenizer.bench.ts now benches a single Tokenizer across the 5 fixtures; parser.bench.ts drops the "regex baseline" name. - Drop stale comparison.json; baseline.json regenerated from the single tokenizer. Docs: - Rewrite `docs/architecture/tokenizer.md` to describe the current tokenizer as-is (no "switched from regex" framing, no comparison tables). - README: replace the perf-comparison row with a neutral description of the tokenizer. - roadmap.md / getting-started.md: drop comparison wording, link to the architecture page. - CHANGELOG: collapse the migration entry into a neutral "Tokenizer" section describing the current design. Tests: 357/357 pass. Lint + typecheck clean across 7 packages. https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up tới PR perf đã merge (FSM tokenizer thay regex tokenizer). Trước giờ kiến trúc tokenizer chỉ tồn tại trong code + commit message — PR này đưa nó vào docs.
docs/architecture/tokenizer.md:WORD/IDENT/NONEmô phỏng đúng\b, lookahead VI, hoặc keyword không có boundary), bounded backtracking cho multi-word identifier vs keyword, operator longest-match trie.packages/parser/bench/comparison.json: 200×–3000× speedup, loại bỏ hành vi siêu tuyến tính.new Parser({ tokenizer: 'regex' })để rollback debug.specs.tsvàKEYWORDSarray trongtokenizer-fsm.ts).pnpm benchvào dev commands.getting-started.md: nhắcpnpm bench/pnpm bench:baselinevà link sang trang kiến trúc.roadmap.md: thêm note "Cập nhật ngoài lịch trình" trỏ sang doc mới.CHANGELOG.md: log perf migration với số bench đầy đủ.Test plan
pnpm test— 402/402 passpnpm docs:build— verify không phát sinh dead link MỚI nào (4 dead link tồn tại trênmainđã reproduce, không liên quan PR này)../roadmap.md,../compatibility.md) trỏ đúngNotes
CONTRIBUTING,packages/plugins/{vite,webpack},basics/index) cố ý không đụng — scope nhỏ, để PR khác xử lý.https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o
Generated by Claude Code