[discussion only — not for merge] Spec-literal jgm/djot#393 heading IDs (preserves non-ASCII)#182
[discussion only — not for merge] Spec-literal jgm/djot#393 heading IDs (preserves non-ASCII)#182dereuromark wants to merge 1 commit into
Conversation
Follow the djot spec rule verbatim: replace each maximal run of non-alphanumeric ASCII characters with a single dash, trim leading and trailing dashes, and preserve every non-ASCII code point (letters, digits and punctuation alike). This drops the previous underscore exception and the Unicode-letter filter, so the only remaining behavior beyond the spec text is the two CSS-validity adjustments (leading-digit h- prefix, empty result falling back to a generated s-N identifier) that the spec leaves unspecified anyway. normalizeId() is now a single ASCII byte-class replacement, so all UTF-8 multibyte sequences are preserved without a separate Unicode pass. Also fix a pre-existing correctness bug that the wider fallback trigger exposed: the s-N fallback returned without recording itself, so it could collide with an explicit id or a heading whose text normalizes to the same value. It now skips taken s-N values and registers the one it uses. Heading-reference tests updated to the new, self-consistent anchors (href now asserted equal to the generated section id). Docs rewritten with the prose-vs-implementation divergence called out.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #182 +/- ##
=========================================
Coverage 91.10% 91.10%
+ Complexity 3321 3320 -1
=========================================
Files 99 99
Lines 8497 8498 +1
=========================================
+ Hits 7741 7742 +1
Misses 756 756 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR updates djot-php’s auto-generated heading ID normalization to follow the settled djot spec wording from jgm/djot#393 (replace each maximal run of non-alphanumeric ASCII with -, trim leading/trailing -, preserve non-ASCII verbatim), while also fixing a pre-existing s-N fallback deduplication collision and updating tests/docs to match the new behavior.
Changes:
- Adjust
HeadingIdTracker::normalizeId()to implement the #393 ASCII-run replacement rule and return''for “no usable content” so the caller can apply ans-Nfallback. - Fix fallback ID generation so
s-Nvalues are deduplicated against both previously-generated IDs and explicitly tracked IDs. - Update renderer/extension tests and the enhancements documentation to reflect the new normalization and anchor behavior (including smart punctuation cases).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/Renderer/HeadingIdTracker.php |
Implements #393 ASCII-run replacement normalization and fixes s-N fallback collision tracking. |
tests/TestCase/Renderer/HeadingIdTrackerTest.php |
Updates/expands normalization + fallback collision test coverage to match the new rules. |
tests/TestCase/Extension/HeadingReferenceExtensionTest.php |
Updates anchor expectations to assert href matches the generated id verbatim with smart punctuation. |
docs/reference/enhancements.md |
Rewrites the “CSS-Safe Heading IDs” section to document the #393-aligned behavior and the intentional divergences. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * Two deliberate, documented deviations keep the result a valid CSS | ||
| * identifier for `querySelector()` / HTMX consumers: | ||
| * - a leading ASCII digit gets an `h-` prefix (a CSS identifier cannot | ||
| * start with a digit); | ||
| * - an empty result is returned as `''` so the caller can fall back to | ||
| * a generated `s-N` identifier (matching djot.js), rather than a | ||
| * literal sentinel. |
What changed
HeadingIdTracker::normalizeId()now follows the #393 rule verbatim:-(this now includes_,',",:,;— the previous_exception is gone).h-prefix (a CSS identifier cannot start with a digit);s-Nidentifier (matching djot.js).Implementation is a single ASCII byte-class replacement (
[\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x7F]+), so all UTF-8 multibyte sequences are untouched and non-ASCII is preserved without a\p{…}/upass.Prose vs. implementation (why this is a draft)
#393 only reworded the spec; the djot.js reference implementation still uses a fixed denylist that preserves
_ ' " : ;and does not collapse-runs. Following the prose therefore differs from current djot.js output on those ASCII characters. Detail and a proposed upstream clarification are in jgm/djot#391.Practical consequence with smart punctuation (on by default):
# Bob's Guide→id="Bob’s-Guide"(U+2019 is non-ASCII, preserved by both the prose and djot.js). The heading-reference extension stays correct because the[[…]]target runs through the same normalization, so thehrefmatches the sectionidverbatim. TwoHeadingReferenceExtensiontests were updated to assert the new, self-consistent anchors (and now also assertid ==hrefto pin that invariant).Fallback dedup fix
While here, fixed a pre-existing correctness bug the wider fallback trigger exposed: the
s-Nfallback returned without recording itself inusedIds, so it could collide with an explicitid="s-1"or a heading whose text normalizes tos-1. The fallback now skips takens-Nvalues and registers the one it uses.Tests / docs
docs/reference/enhancements.mdrewritten: normalization rules, examples, and a prose-vs-impl alignment table with the divergence called out in a note.Verification
phpunit: 2157 tests, 5359 assertions — greenphpstan: no errorsphpcs: cleancodex review: clean (after the fallback-dedup fix)