Skip to content

[discussion only — not for merge] Spec-literal jgm/djot#393 heading IDs (preserves non-ASCII)#182

Closed
dereuromark wants to merge 1 commit into
masterfrom
align-heading-ids-djot-393
Closed

[discussion only — not for merge] Spec-literal jgm/djot#393 heading IDs (preserves non-ASCII)#182
dereuromark wants to merge 1 commit into
masterfrom
align-heading-ids-djot-393

Conversation

@dereuromark

Copy link
Copy Markdown
Contributor

Draft / for discussion — this fully aligns auto heading-ID generation with the jgm/djot#393 spec wording. It intentionally diverges from the current djot.js implementation on a few ASCII characters (see "Prose vs. implementation" below). Opening as draft so we can decide whether to ship this, gate it behind an option, or wait for upstream to reconcile prose vs. reference.

What changed

HeadingIdTracker::normalizeId() now follows the #393 rule verbatim:

replacing each maximal run of non-alphanumeric ASCII characters with -, removing any leading or trailing -

  • Every non-alphanumeric ASCII run → single - (this now includes _, ', ", :, ; — the previous _ exception is gone).
  • Every non-ASCII code point is preserved verbatim — letters, digits and punctuation/symbols (valid CSS identifier code points).
  • The two remaining deviations are the CSS-validity adjustments the spec leaves unspecified anyway:
    1. a leading ASCII digit gets an h- prefix (a CSS identifier cannot start with a digit);
    2. an empty result falls back to a generated s-N identifier (matching djot.js).

Implementation is a single ASCII byte-class replacement ([\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x7F]+), so all UTF-8 multibyte sequences are untouched and non-ASCII is preserved without a \p{…}/u pass.

Prose vs. implementation (why this is a draft)

#393 only reworded the spec; the djot.js reference implementation still uses a fixed denylist that preserves _ ' " : ; and does not collapse - runs. Following the prose therefore differs from current djot.js output on those ASCII characters. Detail and a proposed upstream clarification are in jgm/djot#391.

Practical consequence with smart punctuation (on by default): # Bob's Guideid="Bob’s-Guide" (U+2019 is non-ASCII, preserved by both the prose and djot.js). The heading-reference extension stays correct because the [[…]] target runs through the same normalization, so the href matches the section id verbatim. Two HeadingReferenceExtension tests were updated to assert the new, self-consistent anchors (and now also assert id == href to pin that invariant).

Fallback dedup fix

While here, fixed a pre-existing correctness bug the wider fallback trigger exposed: the s-N fallback returned without recording itself in usedIds, so it could collide with an explicit id="s-1" or a heading whose text normalizes to s-1. The fallback now skips taken s-N values and registers the one it uses.

Tests / docs

  • TDD throughout: red → green for the alignment change, the smart-quote anchors, and the fallback-collision fix.
  • docs/reference/enhancements.md rewritten: normalization rules, examples, and a prose-vs-impl alignment table with the divergence called out in a note.

Verification

  • phpunit: 2157 tests, 5359 assertions — green
  • phpstan: no errors
  • phpcs: clean
  • codex review: clean (after the fallback-dedup fix)

Follow the djot spec rule verbatim: replace each maximal run of
non-alphanumeric ASCII characters with a single dash, trim leading
and trailing dashes, and preserve every non-ASCII code point
(letters, digits and punctuation alike).

This drops the previous underscore exception and the Unicode-letter
filter, so the only remaining behavior beyond the spec text is the
two CSS-validity adjustments (leading-digit h- prefix, empty result
falling back to a generated s-N identifier) that the spec leaves
unspecified anyway.

normalizeId() is now a single ASCII byte-class replacement, so all
UTF-8 multibyte sequences are preserved without a separate Unicode
pass.

Also fix a pre-existing correctness bug that the wider fallback
trigger exposed: the s-N fallback returned without recording itself,
so it could collide with an explicit id or a heading whose text
normalizes to the same value. It now skips taken s-N values and
registers the one it uses.

Heading-reference tests updated to the new, self-consistent anchors
(href now asserted equal to the generated section id). Docs rewritten
with the prose-vs-implementation divergence called out.
@dereuromark dereuromark added documentation Improvements or additions to documentation enhancement New feature or request labels May 19, 2026
@dereuromark dereuromark requested a review from Copilot May 19, 2026 17:25
@codecov

codecov Bot commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.10%. Comparing base (6b90dfe) to head (b83f1cb).

Additional details and impacted files
@@            Coverage Diff            @@
##             master     #182   +/-   ##
=========================================
  Coverage     91.10%   91.10%           
+ Complexity     3321     3320    -1     
=========================================
  Files            99       99           
  Lines          8497     8498    +1     
=========================================
+ Hits           7741     7742    +1     
  Misses          756      756           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates djot-php’s auto-generated heading ID normalization to follow the settled djot spec wording from jgm/djot#393 (replace each maximal run of non-alphanumeric ASCII with -, trim leading/trailing -, preserve non-ASCII verbatim), while also fixing a pre-existing s-N fallback deduplication collision and updating tests/docs to match the new behavior.

Changes:

  • Adjust HeadingIdTracker::normalizeId() to implement the #393 ASCII-run replacement rule and return '' for “no usable content” so the caller can apply an s-N fallback.
  • Fix fallback ID generation so s-N values are deduplicated against both previously-generated IDs and explicitly tracked IDs.
  • Update renderer/extension tests and the enhancements documentation to reflect the new normalization and anchor behavior (including smart punctuation cases).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/Renderer/HeadingIdTracker.php Implements #393 ASCII-run replacement normalization and fixes s-N fallback collision tracking.
tests/TestCase/Renderer/HeadingIdTrackerTest.php Updates/expands normalization + fallback collision test coverage to match the new rules.
tests/TestCase/Extension/HeadingReferenceExtensionTest.php Updates anchor expectations to assert href matches the generated id verbatim with smart punctuation.
docs/reference/enhancements.md Rewrites the “CSS-Safe Heading IDs” section to document the #393-aligned behavior and the intentional divergences.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +99 to +105
* Two deliberate, documented deviations keep the result a valid CSS
* identifier for `querySelector()` / HTMX consumers:
* - a leading ASCII digit gets an `h-` prefix (a CSS identifier cannot
* start with a digit);
* - an empty result is returned as `''` so the caller can fall back to
* a generated `s-N` identifier (matching djot.js), rather than a
* literal sentinel.
@dereuromark dereuromark changed the title Align auto heading IDs with the settled djot spec wording (jgm/djot#393) [discussion only — not for merge] Spec-literal jgm/djot#393 heading IDs (preserves non-ASCII) May 19, 2026
@dereuromark dereuromark deleted the align-heading-ids-djot-393 branch May 20, 2026 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants