Skip to content

tokenizers-overview-concept-page#134

Open
AndreasAbdi wants to merge 5 commits into
mainfrom
tokenizers-overview-concept-page
Open

tokenizers-overview-concept-page#134
AndreasAbdi wants to merge 5 commits into
mainfrom
tokenizers-overview-concept-page

Conversation

@AndreasAbdi

Copy link
Copy Markdown
Contributor

{
"project": "Model Atlas — Tokenizers Overview Concept Page",
"branchName": "tokenizers-overview-concept-page",
"description": "Publish the missing canonical English tokenizers-overview concept page, backed by the existing registry model and localized messages, so readers can search tokenizer topics from one anchor page and move from that overview into nearby glossary terms and specific tokenizer algorithms.",
"context": {
"customerAsk": "Add the missing canonical English docs page for tokenizers-overview under the correct concept-page template, including colocated messages/en.json and any required assets.json. Connect it to the existing concept.tokenizers-overview registry record, or correct that record if its metadata is not sufficient, and wire tags, aliases, and related IDs so search and related-doc surfaces connect it to token, embedding, vocabulary-size, special-tokens, bpe, wordpiece, sentencepiece, and byte-level-tokenization. Keep the slice English-only and ensure a registry-backed canonical page renders on current main with focused validation and tests passing.",
"problem": "The site has individual glossary and tokenizer-adjacent pages, but it is still missing the broad canonical concept page that explains tokenizers as a family. Readers therefore lack a clear entry point for how tokenization fits before embeddings, why vocabulary design affects sequence length and cost, and which nearby tokenizer algorithms or glossary terms they should read next. Discovery surfaces also do not yet have one obvious tokenizer overview destination to route readers toward.",
"solution": "Publish a canonical tokenizers-overview concept page backed by the existing concept registry model, with English-only message-driven content and only the minimal local asset support the page truly needs. Treat tokenizers-overview as the broad tokenization concept, then correct or complete its registry metadata so the page is discoverable and well-connected to token, embedding, vocabulary-size, special-tokens, bpe, wordpiece, sentencepiece, and byte-level-tokenization."
},
"acceptanceCriteria": [
"A published canonical concept page exists for tokenizers-overview with matching frontmatter, English messages, and any required local assets.",
"The page is backed by concept.tokenizers-overview, and that registry record is published with aliases, tags, and related IDs sufficient for search and related-doc discovery.",
"The page follows canonical docs writing standards: plain-language lead, no page-meta prose, one clear section job at a time, and no decorative graph or comparison aid unless it materially teaches the tokenizer family.",
"Readers can navigate between tokenizers-overview and the shipped nearby pages token, embedding, vocabulary-size, special-tokens, bpe, wordpiece, sentencepiece, and byte-level-tokenization through search, tags, or related-doc surfaces where those targets already exist.",
"The page renders on current main as the canonical docs route for tokenizers-overview and does not require locale-shell or unrelated taxonomy work to ship.",
"Focused validation covers the registry, messages, route wiring, and at least one discovery expectation specific to tokenizers-overview.",
"Quality gate: make typecheck, make lint, and make test pass."
],
"userStories": [
{
"id": "tokenizers-overview-concept-page-001",
"title": "Publish a complete registry-backed tokenizer overview record",
"description": "As a reader searching for tokenization topics, I want tokenizers-overview to be a published first-class concept record so search, tags, and related-doc logic have a canonical destination for the tokenizer family.",
"acceptanceCriteria": [
"concept.tokenizers-overview exists as a published concept record with stable slug tokenizers-overview, taxonomy that fits the broad tokenization concept, and default title/summary keys aligned with the canonical page.",
"Aliases cover representative search forms such as tokenizers, tokenizer overview, text tokenization, and other accurate high-intent variants without misclassifying tokenizer algorithms as the overview itself.",
"Tags and related IDs connect tokenizers-overview to token, embedding, vocabulary-size, special-tokens, bpe, wordpiece, sentencepiece, and byte-level-tokenization when those canonical targets exist in the branch.",
"The story stays limited to the metadata required to make tokenizers-overview discoverable and correctly classified, without broad unrelated taxonomy cleanup.",
"Typecheck passes",
"Tests pass"
],
"priority": 1,
"passes": true,
"notes": ""
},
{
"id": "tokenizers-overview-concept-page-002",
"title": "Publish the canonical tokenizer overview concept page",
"description": "As a technical layperson learning language models, I want a dedicated tokenizer overview page so I can understand what tokenizers do, why they matter before embeddings and attention, and how the main algorithm families differ at a high level.",
"acceptanceCriteria": [
"A canonical concept page exists at /docs/concepts/tokenizers-overview using the concept-page template with matching frontmatter, messages/en.json, and any required local assets.json.",
"The page opens with a plain-language lead and then explains, in separate clear sections, what tokenizers are, why they matter, how they affect sequence length or cost, and how readers should distinguish the overview from specific algorithms and adjacent glossary terms.",
"The page gives readers onward paths to token, embedding, vocabulary-size, special-tokens, bpe, wordpiece, sentencepiece, and byte-level-tokenization without turning into a benchmark page, paper download page, or locale-plumbing change.",
"If the page includes a graph or comparison aid, it is the minimum teaching aid needed for the tokenizer family; if no visual materially improves understanding, the page ships without decorative asset churn.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 2,
"passes": true,
"notes": ""
},
{
"id": "tokenizers-overview-concept-page-003",
"title": "Make tokenizer discovery surfaces route readers into the overview",
"description": "As a reader exploring model inputs and preprocessing, I want search, tags, and related-doc surfaces to guide me into tokenizers-overview and then onward to nearby tokenizer and glossary pages.",
"acceptanceCriteria": [
"Representative queries such as tokenizer, tokenizers, text tokenization, or how text becomes tokens can return the canonical tokenizers-overview page as a direct relevant result.",
"The published tokenizers-overview page renders tag and related-doc surfaces that expose meaningful navigation to shipped nearby pages including token, embedding, vocabulary-size, special-tokens, bpe, wordpiece, sentencepiece, and byte-level-tokenization.",
"At least one neighboring shipped page or discovery surface can lead a reader into tokenizers-overview without requiring them to type the slug directly.",
"Browser-visible rendering shows the page title, summary, tags, and related-doc navigation without missing-content placeholders or broken links.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 3,
"passes": true,
"notes": ""
},
{
"id": "tokenizers-overview-concept-page-004",
"title": "Add focused validation for the tokenizer overview contract",
"description": "As a maintainer, I want targeted automated proof for the tokenizer overview slice so route, registry, message, and discovery regressions are caught without broad unrelated test expansion.",
"acceptanceCriteria": [
"Validation or tests confirm the tokenizers-overview route, concept.tokenizers-overview record, and default English messages resolve together as one canonical page.",
"Coverage asserts at least one discovery expectation specific to tokenizers-overview, such as search relevance, related-doc routing, or adjacent-page linkage.",
"Coverage remains behavioral and focused on the observable page slice rather than inventory snapshots, locale-manifest churn, or meta-test scaffolding.",
"Typecheck passes",
"Tests pass"
],
"priority": 4,
"passes": true,
"notes": ""
}
]
}

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Addressed mergeability follow-up from CI on the PR merge commit. The failing tokenizer-overview search test was asserting rank-1 for broad queries like tokenizer, tokenizers, and text tokenization, but the merged index can legitimately rank /docs/glossary/special-tokens ahead of the overview while still returning /docs/concepts/tokenizers-overview as a direct relevant result. Updated src/lib/content/tokenizers-overview-concept.test.ts to require the canonical overview to appear within the top results instead of forcing rank 1, which matches the shipped story contract. Revalidated with the focused tokenizer overview test, make lint, make typecheck, and make test.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Additional CI note at 2026-06-19T18:05:43Z UTC: all gates on commit 4ab35e3 are green except build-export, which has now hung twice on GitHub Actions without meaningful progress. On the latest rerun, build-export entered Run build-export at 2026-06-19T17:48:11Z and the workflow updated_at stayed frozen at 2026-06-19T17:47:45Z for more than 15 minutes, so I treated the prior run as stale, canceled it, and reran CI once. I also reproduced make build-export locally after the mergeability fix; it completed successfully in about 49 seconds. Remaining blocker is the external Actions runner hang, not a reproducible branch failure.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T18:08:56Z UTC: verified PR #134 still points at branch head 4ab35e3 and that the reviewed tokenizer-overview files are still present in the PR diff. I reproduced rtk make build-export locally on this head; it completed successfully and did not leave new generated-file drift. I then canceled the stale build-export job on workflow run 27839837114 after it had been hung for well over 15 minutes, but GitHub Actions kept the workflow in a queued rerun state and gh run rerun now reports This workflow is already running. Remaining blocker is the external queued Actions state rather than a reproducible branch failure.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T18:11:05Z UTC: PR #134 still points at branch head 4ab35e3, and I re-verified that the reviewed tokenizer-overview files remain present in the PR diff. The previously canceled CI workflow run 27839837114 has now been rerun successfully; GitHub Actions shows the same run record on attempt 3 for head SHA 4ab35e3 with fresh per-job check URLs. There is still no reviewer conversation feedback to address, and the only remaining blocker is the currently in-progress required CI rerun.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T18:15:39Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer conversation feedback to address. The prior required CI attempt was stale on GitHub Actions, so I canceled workflow run 27839837114 once its stuck jobs reached the external-runner hang pattern again and then reran it cleanly. GitHub has now started attempt 4 on the same run record for head SHA 4ab35e3; required checks are active again (lint, test-verify-contract, and build-export started immediately, with the rest queued behind them), so the only remaining blocker is fresh in-progress required CI on the latest reviewed head.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T18:19:25Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer conversation feedback to address. The prior required CI workflow run 27839837114 was still stale well beyond the 15-minute UTC threshold, so I canceled attempt 4, let GitHub settle it to terminal state, and then explicitly reran the workflow. GitHub has now started attempt 5 on the same run record for head SHA 4ab35e3; all required checks are attached to the latest reviewed head and are currently fresh pending/queued, so the remaining blocker is active CI rather than a reproducible branch failure.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T18:23:08Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer conversation feedback to address. The required CI workflow run 27839837114 was stuck in a stale queued state on attempt 5, so I canceled that attempt, waited for GitHub to settle it to terminal state, and then reran the workflow cleanly. GitHub has now started attempt 6 on the same run record for head SHA 4ab35e3; all required checks are attached to the latest reviewed head with fresh pending check-run IDs, so the remaining blocker is active CI rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T18:31:44Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer conversation feedback to address. The last required CI run on this head was stale well beyond the 15-minute threshold (build-export remained in progress while the workflow updated_at stayed frozen at 2026-06-19T18:28:09Z), so I canceled workflow run 27839837114, used the Actions force-cancel endpoint when GitHub left it in a ghost queued state, and then reran it cleanly. GitHub has now started attempt 7 on the same run record for head SHA 4ab35e3; all required checks are attached to the latest reviewed head with fresh pending check-run IDs, so the remaining blocker is active CI rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T18:43:18Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer blocking feedback to address. The last required CI attempt on workflow run 27839837114 was stale well beyond the 15-minute UTC threshold: build-export was still in progress from 2026-06-19T18:39:18Z, test-verify-contract was still queued from 2026-06-19T18:30:11Z, and the workflow had stopped updating. I canceled attempt 7, force-canceled the ghost state, reran the workflow on the same head, and verified through gh pr checks plus the commit check-runs API that fresh required check-run IDs are now attached to head 4ab35e3, with all checks newly queued as of 2026-06-19T18:43:49Z to 2026-06-19T18:43:50Z UTC. Remaining blocker is fresh in-progress required CI on the latest reviewed head rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T18:48:31Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer blocking feedback to address. Workflow run 27839837114 attempt 8 was stale beyond the 15-minute UTC threshold with updated_at stuck at 2026-06-19T18:43:49Z, so I canceled that attempt, used the Actions force-cancel endpoint, waited for terminal state, and reran the workflow cleanly. Verified via the Actions run API and rtk gh pr checks that attempt 9 is now attached to head 4ab35e3 with fresh required check-run IDs created at 2026-06-19T18:48:27Z to 2026-06-19T18:48:28Z UTC. Remaining blocker is fresh in-progress required CI on the latest reviewed head rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T19:07:24Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer PR conversation feedback to address. Workflow run 27839837114 attempt 9 was stale well beyond the 15-minute UTC threshold with updated_at frozen at 2026-06-19T18:48:28Z while required checks build-export, linkcheck, and test-integration stayed queued, so I force-canceled that stuck run and reran the same workflow record. Fresh required check-run IDs are now attached to the same reviewed head as of 2026-06-19T19:07:21Z UTC; remaining blocker is the fresh in-progress CI rerun rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-20T19:14:33Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer PR conversation feedback to address. Required CI workflow run 27839837114 attempt 10 had been stuck with every required check still pending from 2026-06-19T19:07:21Z, so I force-canceled that stale ghost run and reran the workflow cleanly. Verified via the Actions run API and gh pr checks that attempt 11 is now attached to the same reviewed head with fresh required check-run IDs, so the only remaining blocker is active required CI on the latest PR head.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T19:17:25Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer PR conversation feedback to address. Required CI workflow run 27839837114 attempt 11 had become a stale ghost run more than 24 hours old in UTC, so I submitted gh run cancel, confirmed the run was still stuck in ghost queued state, used the Actions force-cancel endpoint to push it to completed/cancelled, and then reran the workflow cleanly. Verified via the Actions run API and rtk gh pr checks that attempt 12 is now attached to the same reviewed head with fresh required check-run IDs as of 2026-06-19T19:17:08Z to 2026-06-19T19:17:10Z UTC. Remaining blocker is fresh in-progress required CI on the latest reviewed head rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T19:19:33Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer PR conversation feedback to address. Required CI workflow run 27839837114 attempt 12 had remained stale in ghost queued state with all ten required checks still pending from 2026-06-19T19:17:10Z, so I used the Actions force-cancel endpoint to push that run to completed/cancelled and then reran the same workflow record. Verified via the Actions run API and rtk gh pr checks that attempt 13 is now attached to the same reviewed head with fresh required check-run IDs created at 2026-06-19T19:19:04Z to 2026-06-19T19:19:06Z UTC. Remaining blocker is fresh in-progress required CI on the latest reviewed head rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T19:21:15Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer PR conversation feedback to address. Required CI workflow run 27839837114 attempt 13 was stale well beyond the 15-minute UTC threshold with every required check still pending and the run frozen at updated_at=2026-06-19T19:19:06Z, so I submitted gh run cancel, used the Actions force-cancel endpoint to push the ghost run to completed/cancelled, and reran the workflow. Verified via the Actions run API and rtk gh pr checks that attempt 14 is now attached to the same reviewed head with fresh required check-run IDs created at 2026-06-19T19:21:15Z UTC. Remaining blocker is fresh in-progress required CI on the latest reviewed head rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T19:25:29Z UTC: PR #134 still points at head , the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer PR conversation feedback to address. Required CI workflow run attempt 14 had remained stuck in ghost state from , so I force-canceled that stale run, confirmed it reached , and reran the workflow on the same reviewed head. Verified via the Actions run API and build-export pending 0 https://github.com/portpowered/ai-model-reference/actions/runs/27839837114/job/82410898791
coverage pending 0 https://github.com/portpowered/ai-model-reference/actions/runs/27839837114/job/82410898785
linkcheck pending 0 https://github.com/portpowered/ai-model-reference/actions/runs/27839837114/job/82410898916
lint pending 0 https://github.com/portpowered/ai-model-reference/actions/runs/27839837114/job/82410898769
test pending 0 https://github.com/portpowered/ai-model-reference/actions/runs/27839837114/job/82410898866
test-build-contract pending 0 https://github.com/portpowered/ai-model-reference/actions/runs/27839837114/job/82410898756
test-integration pending 0 https://github.com/portpowered/ai-model-reference/actions/runs/27839837114/job/82410898991
test-verify-contract pending 0 https://github.com/portpowered/ai-model-reference/actions/runs/27839837114/job/82410898787
typecheck pending 0 https://github.com/portpowered/ai-model-reference/actions/runs/27839837114/job/82410898809
validate-data pending 0 https://github.com/portpowered/ai-model-reference/actions/runs/27839837114/job/82410898884 that fresh required check-run IDs are now attached to head as of UTC. Remaining blocker is fresh in-progress required CI on the latest reviewed head rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T19:25:29Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer PR conversation feedback to address. Required CI workflow run 27839837114 attempt 14 had remained stuck in ghost queued state from 2026-06-19T19:21:15Z, so I force-canceled that stale run, confirmed it reached completed/cancelled, and reran the workflow on the same reviewed head. Verified via the Actions run API and rtk gh pr checks that fresh required check-run IDs are now attached to head 4ab35e3 as of 2026-06-19T19:25:29Z UTC. Remaining blocker is fresh in-progress required CI on the latest reviewed head rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T19:29:00Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer PR conversation feedback to address. Required CI workflow run 27839837114 attempt 15 was still stuck in ghost queued state with every required check pending from 2026-06-19T19:25:31Z, so I canceled it, used the Actions force-cancel endpoint to push it to completed/cancelled, and reran the workflow. Verified via the Actions run API and rtk gh pr checks that attempt 16 is now attached to the same reviewed head with fresh required check-run IDs created at 2026-06-19T19:28:53Z UTC. Remaining blocker is fresh in-progress required CI on the latest reviewed head rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T19:31:10Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer PR conversation feedback to address. Required CI workflow run 27839837114 attempt 16 was stale in ghost queued state with every required check still pending from 2026-06-19T19:28:54Z, so I used the Actions force-cancel endpoint to push that run to completed/cancelled and then reran the same workflow record. Verified via the Actions run API and rtk gh pr checks that attempt 17 is now attached to the same reviewed head with fresh required check-run IDs created at 2026-06-19T19:30:58Z UTC. Remaining blocker is fresh in-progress required CI on the latest reviewed head rather than unresolved branch work.

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Mergeability follow-up at 2026-06-19T19:34:43Z UTC: PR #134 still points at head 4ab35e3, the reviewed tokenizer-overview files remain present in the PR diff, and there is still no reviewer PR conversation feedback to address. Required CI workflow run 27839837114 attempt 17 had remained stuck in ghost queued state with every required check still pending from 2026-06-19T19:30:58Z, so I canceled it, used the Actions force-cancel endpoint to push that stale run to completed/cancelled, and reran the same workflow on the same reviewed head. Verified via gh pr view --json statusCheckRollup that fresh required check-run IDs are now attached to head 4ab35e3 as of 2026-06-19T19:34:28Z UTC (lint 82411947436, typecheck 82411947438, test 82411947423, test-verify-contract 82411947490, coverage 82411947446, test-build-contract 82411947448, build-export 82411947488, test-integration 82411947512, validate-data 82411947455, linkcheck 82411947467). Remaining blocker is fresh in-progress required CI on the latest reviewed head rather than unresolved branch work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant