From b83f1cbf8cf6fffa52d43006a9974216ae0c0a73 Mon Sep 17 00:00:00 2001 From: mscherer Date: Tue, 19 May 2026 19:24:27 +0200 Subject: [PATCH] Align auto-generated heading IDs with the settled djot spec wording Follow the djot spec rule verbatim: replace each maximal run of non-alphanumeric ASCII characters with a single dash, trim leading and trailing dashes, and preserve every non-ASCII code point (letters, digits and punctuation alike). This drops the previous underscore exception and the Unicode-letter filter, so the only remaining behavior beyond the spec text is the two CSS-validity adjustments (leading-digit h- prefix, empty result falling back to a generated s-N identifier) that the spec leaves unspecified anyway. normalizeId() is now a single ASCII byte-class replacement, so all UTF-8 multibyte sequences are preserved without a separate Unicode pass. Also fix a pre-existing correctness bug that the wider fallback trigger exposed: the s-N fallback returned without recording itself, so it could collide with an explicit id or a heading whose text normalizes to the same value. It now skips taken s-N values and registers the one it uses. Heading-reference tests updated to the new, self-consistent anchors (href now asserted equal to the generated section id). Docs rewritten with the prose-vs-implementation divergence called out. --- docs/reference/enhancements.md | 54 +++++------ src/Renderer/HeadingIdTracker.php | 68 ++++++++------ .../HeadingReferenceExtensionTest.php | 16 ++-- .../Renderer/HeadingIdTrackerTest.php | 90 ++++++++++++++++--- 4 files changed, 159 insertions(+), 69 deletions(-) diff --git a/docs/reference/enhancements.md b/docs/reference/enhancements.md index 09d36829..00f404b4 100644 --- a/docs/reference/enhancements.md +++ b/docs/reference/enhancements.md @@ -156,22 +156,20 @@ content is a symbol falls back to a generated `s-N` ID. ## CSS-Safe Heading IDs -**Related:** [php-collective/djot-php#92](https://github.com/php-collective/djot-php/pull/92), [jgm/djot#391](https://github.com/jgm/djot/issues/391) +**Related:** [php-collective/djot-php#92](https://github.com/php-collective/djot-php/pull/92), [jgm/djot#391](https://github.com/jgm/djot/issues/391), [jgm/djot#393](https://github.com/jgm/djot/pull/393) **Status:** Implemented in djot-php -Auto-generated heading IDs are normalized to be valid CSS selectors, ensuring compatibility with `querySelector()`, HTMX scroll restoration, and CSS attribute selectors. +Auto-generated heading IDs follow the djot spec rule settled in [jgm/djot#393](https://github.com/jgm/djot/pull/393), with two small deviations that keep every ID a valid CSS selector — ensuring compatibility with `querySelector()`, HTMX scroll restoration, and CSS attribute selectors. ### Normalization Rules -1. **Strip `#` characters** — Prevents invalid selectors -2. **Trim whitespace** — Clean leading/trailing spaces -3. **Whitespace to dashes** — Spaces become single `-` -4. **Invalid characters to dashes** — Only Unicode letters (`\p{L}`), numbers (`\p{N}`), hyphens, and underscores are preserved -5. **Collapse consecutive dashes** — `foo--bar` becomes `foo-bar` -6. **Trim leading/trailing dashes** — `-foo-` becomes `foo` -7. **Prefix digits** — IDs starting with a number get `h-` prefix (CSS requirement) -8. **Fallback** — Empty results become `heading` +1. **Exclude non-textual content** — Symbols (`:name:`) and footnote references do not contribute to the ID +2. **Replace non-alphanumeric ASCII** — Each maximal run of non-alphanumeric ASCII characters (anything except `A–Z`, `a–z`, `0–9`) becomes a single `-` +3. **Preserve non-ASCII** — Unicode letters, digits, punctuation, and symbols are kept verbatim (they are valid CSS identifier code points) +4. **Trim leading/trailing dashes** — `-foo-` becomes `foo` +5. **Prefix digits** *(CSS deviation)* — A result starting with an ASCII digit gets an `h-` prefix (a CSS identifier cannot start with a digit) +6. **Empty fallback** *(CSS deviation)* — A result with no usable content falls back to a generated `s-N` identifier (matching djot.js) ### Examples @@ -179,12 +177,14 @@ Auto-generated heading IDs are normalized to be valid CSS selectors, ensuring co |---------|--------------| | `# Hello World` | `Hello-World` | | `# Hello World!` | `Hello-World` | +| `# foo_bar` | `foo-bar` | | `# 日本語の見出し` | `日本語の見出し` | | `# Привет мир` | `Привет-мир` | +| `# Spec — Notes` | `Spec-—-Notes` | | `# E=mc^2` | `E-mc-2` | | `# 123 Numbers First` | `h-123-Numbers-First` | | `# $this->method()` | `this-method` | -| `# ###` | `heading` | +| `# ###` | `s-1` | ### Unicode Preservation @@ -231,23 +231,27 @@ Explicit IDs are used as-is without normalization. ### Spec Alignment -The remove-vs-replace question raised in [jgm/djot#391](https://github.com/jgm/djot/issues/391) was settled by [jgm/djot#393](https://github.com/jgm/djot/pull/393), which reworded the spec to: *"replacing each maximal run of non-alphanumeric ASCII characters with `-`, removing any leading or trailing `-`"*. Note that #393 only changes the spec **prose** — the djot.js reference implementation is unchanged and (per djot's own changelog policy) remains the authoritative behavior. The new prose is actually broader than djot.js itself: it would also strip `_`, which djot.js keeps. +The remove-vs-replace question raised in [jgm/djot#391](https://github.com/jgm/djot/issues/391) was settled by [jgm/djot#393](https://github.com/jgm/djot/pull/393), which reworded the spec to: *"replacing each maximal run of non-alphanumeric ASCII characters with `-`, removing any leading or trailing `-`"*. -djot-php replaces (does not remove) mid-word punctuation — the direction #393 settled on — and tracks the djot.js **implementation** where the prose and implementation disagree, deliberately deviating only where required to produce valid CSS identifiers for `querySelector()` consumers. +djot-php follows that **prose verbatim**: every non-alphanumeric ASCII run (including `_`, `'`, `"`, `:`, `;`) becomes `-`, and every non-ASCII code point — letters, digits, and punctuation/symbols alike — is preserved. The only two deviations are the CSS-validity adjustments (leading-digit `h-` prefix, empty → `s-N`), which the spec leaves unspecified anyway. -| Aspect | djot.js reference impl | #393 spec prose | djot-php | -|--------|------------------------|-----------------|----------| -| Mid-word punctuation (`A+B=C`) | `A-B-C` | `A-B-C` | `A-B-C` | -| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` | preserve → `Über-uns` | -| Consecutive punctuation (`foo...bar`) | collapse → `foo-bar` | collapse → `foo-bar` | collapse → `foo-bar` | -| Underscore (`foo_bar`) | keep → `foo_bar` | strip → `foo-bar` | keep → `foo_bar` (follows impl; CSS-valid) | -| Apostrophe (`That's all`) | preserve → `That's-all` | replace → `That-s-all` | replace → `That-s-all` (CSS-safe) | -| Double quote / `;` / `:` | preserve | replace | replace with `-` (CSS-safe) | -| Leading digit (`2024 recap`) | `2024-recap` | `2024-recap` | prefix → `h-2024-recap` (CSS requires non-digit start) | -| Empty result (`!!!`) | `s-N` family | unspecified | fallback → `heading` | -| Symbols / footnote refs | excluded | excluded | excluded | +| Aspect | #393 spec prose | djot-php | +|--------|-----------------|----------| +| Mid-word punctuation (`A+B=C`) | `A-B-C` | `A-B-C` | +| Underscore (`foo_bar`) | replace → `foo-bar` | replace → `foo-bar` | +| Apostrophe (`That's all`) | replace → `That-s-all` | replace → `That-s-all` | +| Double quote / `;` / `:` | replace → `-` | replace → `-` | +| Consecutive punctuation (`foo...bar`) | collapse → `foo-bar` | collapse → `foo-bar` | +| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` | +| Non-ASCII punctuation / smart quotes (`Bob’s Guide`) | preserve → `Bob’s-Guide` | preserve → `Bob’s-Guide` | +| Leading digit (`2024 recap`) | `2024-recap` | **`h-2024-recap`** *(CSS deviation)* | +| Empty result (`!!!`, `###`) | unspecified | **`s-N` fallback** *(CSS deviation)* | +| Symbols / footnote refs | excluded | excluded | -The apostrophe / quote / semicolon / colon deviation is deliberate: these characters are not valid in unescaped CSS identifiers, so preserving them per djot.js would force every JS consumer to round-trip through `CSS.escape()` before doing a selector lookup. The leading-digit and empty-result behaviors fill in gaps that the spec and implementation handle inconsistently. +> [!NOTE] +> #393 only changed the spec **prose**, not the djot.js reference implementation, which still uses a fixed punctuation denylist that *preserves* `_ ' " : ;` and does not collapse `-` runs. So djot-php (following the prose) intentionally differs from current djot.js output on those ASCII characters. This prose-vs-implementation gap is tracked upstream in [jgm/djot#391](https://github.com/jgm/djot/issues/391); when smart punctuation is enabled, note that both the prose *and* djot.js preserve the resulting non-ASCII quotes, so `# Bob's Guide` → `id="Bob’s-Guide"` (the heading-reference extension stays consistent because the link target runs through the same normalization). + +The two CSS-validity deviations are deliberate: a leading digit makes `querySelector('#2col')` a syntax error, and an empty/sentinel id is worse than a stable generated `s-N`. Everything else is verbatim #393. --- diff --git a/src/Renderer/HeadingIdTracker.php b/src/Renderer/HeadingIdTracker.php index 78bdd017..9cec90b4 100644 --- a/src/Renderer/HeadingIdTracker.php +++ b/src/Renderer/HeadingIdTracker.php @@ -88,37 +88,43 @@ public function trackId(string $id): void } /** - * Normalize text into a valid CSS identifier string + * Normalize heading text into an identifier (jgm/djot#393) * - * 1. Strip # characters entirely - * 2. Trim whitespace - * 3. Replace whitespace sequences (including Unicode spaces) with single dashes - * 4. Replace any remaining characters that are invalid in CSS identifiers - * (anything other than Unicode letters/numbers, hyphens, and underscores) - * with dashes - * 5. Collapse consecutive dashes and trim leading/trailing dashes - * 6. Prefix with 'h-' if the result starts with a digit, ensuring a valid - * CSS ident start (digits are not allowed as the first character) + * Each maximal run of non-alphanumeric ASCII characters is replaced with + * a single `-`, and leading/trailing `-` are trimmed. Non-ASCII + * characters (Unicode letters, digits, punctuation, symbols) are + * preserved verbatim — they fall outside the spec's ASCII replacement + * set and are valid CSS identifier code points. * - * Producing a valid CSS identifier ensures that consumers such as HTMX, - * which call `querySelector` with the section ID for scroll-restoration, - * do not throw a SyntaxError when headings contain inline code or special - * characters (e.g. `$this->t($key, $params = [], $fallback = '')`). + * Two deliberate, documented deviations keep the result a valid CSS + * identifier for `querySelector()` / HTMX consumers: + * - a leading ASCII digit gets an `h-` prefix (a CSS identifier cannot + * start with a digit); + * - an empty result is returned as `''` so the caller can fall back to + * a generated `s-N` identifier (matching djot.js), rather than a + * literal sentinel. + * + * @return string The identifier, or '' when the text has no usable content. */ public function normalizeId(string $text): string { - $id = str_replace('#', '', $text); - $id = trim($id); - $id = preg_replace('/\s+/u', '-', $id) ?? $id; - $id = preg_replace('/[^\p{L}\p{N}_-]+/u', '-', $id) ?? $id; - $id = preg_replace('/-{2,}/', '-', $id) ?? $id; + // 0x30-0x39 = 0-9, 0x41-0x5A = A-Z, 0x61-0x7A = a-z. Every other + // byte in 0x00-0x7F is non-alphanumeric ASCII; bytes >= 0x80 (all + // UTF-8 multibyte sequences) are left untouched so non-ASCII text is + // preserved. No /u flag: the class only ever matches single ASCII + // bytes, never a continuation byte of a multibyte character. + $id = preg_replace('/[\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x7F]+/', '-', $text) ?? $text; $id = trim($id, '-'); - if ($id !== '' && preg_match('/^\p{N}/u', $id)) { + if ($id === '') { + return ''; + } + + if (preg_match('/^[0-9]/', $id) === 1) { $id = 'h-' . $id; } - return $id !== '' ? $id : 'heading'; + return $id; } /** @@ -216,15 +222,23 @@ protected function generateId(Heading $node): string // excluding non-textual elements such as symbols and footnote // references (jgm/djot#393). $idText = $this->extractPlainText($node, forId: true); + $baseId = $this->normalizeId($idText); - if ($idText === '') { - // Generate fallback ID - $this->sectionCounter++; + if ($baseId === '') { + // No usable content (empty heading, or text that is entirely + // ASCII punctuation) — fall back to a generated `s-N` + // identifier, matching djot.js. Skip any `s-N` already taken by + // an explicit id or a heading whose text normalizes to it, so + // the fallback never produces a duplicate. + do { + $this->sectionCounter++; + $baseId = 's-' . $this->sectionCounter; + } while (isset($this->usedIds[$baseId])); - return 's-' . $this->sectionCounter; - } + $this->usedIds[$baseId] = 0; - $baseId = $this->normalizeId($idText); + return $baseId; + } // Track and deduplicate if (!isset($this->usedIds[$baseId])) { diff --git a/tests/TestCase/Extension/HeadingReferenceExtensionTest.php b/tests/TestCase/Extension/HeadingReferenceExtensionTest.php index 15cfcc6a..92031197 100644 --- a/tests/TestCase/Extension/HeadingReferenceExtensionTest.php +++ b/tests/TestCase/Extension/HeadingReferenceExtensionTest.php @@ -172,16 +172,19 @@ public function testHeadingWithSmartQuotesMatchesStraightQuoteReference(): void $converter = new DjotConverter(); $converter->addExtension(new HeadingReferenceExtension()); - // The parser converts straight quotes to smart quotes in heading text, - // but reference targets keep straight quotes. The extension normalizes - // quotes for matching so this should resolve correctly. + // The parser converts straight quotes to smart quotes in heading text. + // Per jgm/djot#393 the resulting non-ASCII quote characters are + // preserved in the identifier; the reference still resolves because + // both the heading ID and the link target run through the same + // normalization, so the href must equal the section id verbatim. $html = $converter->convert(<<<'DJOT' See [[Say "Hello"]]. # Say "Hello" DJOT); - $this->assertStringContainsString('href="#Say-Hello"', $html); + $this->assertStringContainsString('id="Say-“Hello”"', $html); + $this->assertStringContainsString('href="#Say-“Hello”"', $html); $this->assertStringNotContainsString('[[Say "Hello"]]', $html); } @@ -210,7 +213,10 @@ public function testHeadingWithApostropheResolvesCorrectly(): void # Bob's Guide DJOT); - $this->assertStringContainsString('href="#Bob-s-Guide"', $html); + // Smart-punctuation turns the apostrophe into U+2019, which jgm/djot#393 + // preserves (non-ASCII). The href must match the generated section id. + $this->assertStringContainsString('id="Bob’s-Guide"', $html); + $this->assertStringContainsString('href="#Bob’s-Guide"', $html); $this->assertStringNotContainsString('data-heading-ref=', $html); $this->assertStringNotContainsString('[[Bob\'s Guide]]', $html); } diff --git a/tests/TestCase/Renderer/HeadingIdTrackerTest.php b/tests/TestCase/Renderer/HeadingIdTrackerTest.php index 8ab68f13..4d6497ca 100644 --- a/tests/TestCase/Renderer/HeadingIdTrackerTest.php +++ b/tests/TestCase/Renderer/HeadingIdTrackerTest.php @@ -160,18 +160,20 @@ public function testNormalizeId(): void $this->assertSame('this-t-key-params-fallback', $this->tracker->normalizeId("\$this->t(\$key, \$params = [], \$fallback = '')")); $this->assertSame('My-title', $this->tracker->normalizeId('My --- title')); $this->assertSame('日本語の見出し', $this->tracker->normalizeId('日本語の見出し')); - $this->assertSame('heading', $this->tracker->normalizeId('###')); + $this->assertSame('', $this->tracker->normalizeId('###')); $this->assertSame('h-123-Things', $this->tracker->normalizeId('123 Things')); $this->assertSame('h-1-Introduction', $this->tracker->normalizeId('1. Introduction')); } /** - * Pins behaviour discussed in jgm/djot#391 (spec wording on auto-ID generation). + * Pins the auto-ID rule settled in jgm/djot#393: each maximal run of + * non-alphanumeric ASCII characters is replaced with `-`, leading/trailing + * `-` are trimmed, and non-ASCII characters are preserved verbatim. * - * djot-php sides with djot.js / djoths on remove-vs-replace (mid-word punctuation - * becomes `-`), and deliberately deviates on apostrophes / quotes / `;` / `:` by - * also replacing them, so generated IDs are valid CSS identifiers and safe to use - * with `querySelector()`. + * djot-php follows this prose, including dropping the previous `_` + * exception. The only deliberate deviations are the two CSS-validity + * adjustments (leading-digit `h-` prefix, empty result → `s-N` fallback), + * which the heading-level tests cover. */ public function testNormalizeIdSpecAlignmentEdgeCases(): void { @@ -179,10 +181,15 @@ public function testNormalizeIdSpecAlignmentEdgeCases(): void $this->assertSame('Emphasis-strong', $this->tracker->normalizeId('Emphasis/strong')); $this->assertSame('That-s-all', $this->tracker->normalizeId("That's all")); $this->assertSame('foo-bar', $this->tracker->normalizeId('foo...bar')); + $this->assertSame('foo-bar-baz', $this->tracker->normalizeId('foo_bar baz')); $this->assertSame('Uber-uns', $this->tracker->normalizeId('Uber uns')); $this->assertSame('Über-uns', $this->tracker->normalizeId('Über uns')); + // Non-ASCII punctuation/symbols are not "non-alphanumeric ASCII", so + // they are preserved (and are valid CSS identifier code points). + $this->assertSame('A–B', $this->tracker->normalizeId('A–B')); + $this->assertSame('café—bar', $this->tracker->normalizeId('café—bar')); $this->assertSame('h-2024-recap', $this->tracker->normalizeId('2024 recap')); - $this->assertSame('heading', $this->tracker->normalizeId('!!!')); + $this->assertSame('', $this->tracker->normalizeId('!!!')); } public function testGetPlainText(): void @@ -348,17 +355,76 @@ public function testSymbolsRetainedInPlainText(): void } /** - * djot.js keeps `_` (it is not in its punctuation denylist) and it is a - * valid CSS identifier character, so djot-php keeps it too. This pins the - * deliberate divergence from the looser #393 spec prose. + * Per the jgm/djot#393 wording, `_` is a non-alphanumeric ASCII character + * and is replaced with `-` like any other punctuation (the previous `_` + * exception is gone). */ - public function testUnderscoreRetainedInId(): void + public function testUnderscoreReplacedInId(): void { $heading = new Heading(2); $heading->appendChild(new Text('foo_bar baz')); $id = $this->tracker->getIdForHeading($heading); - $this->assertSame('foo_bar-baz', $id); + $this->assertSame('foo-bar-baz', $id); + } + + /** + * A heading whose text normalizes to nothing (all ASCII punctuation) + * falls back to a generated `s-N` identifier, matching djot.js — not the + * literal `heading` sentinel djot-php used previously. + */ + public function testAllPunctuationHeadingGetsFallbackId(): void + { + $heading = new Heading(2); + $heading->appendChild(new Text('!!!')); + + $id = $this->tracker->getIdForHeading($heading); + + $this->assertSame('s-1', $id); + } + + /** + * The generated `s-N` fallback must not collide with a real heading whose + * text normalizes to the same value (e.g. `# s 1` → `s-1`). + */ + public function testFallbackIdDoesNotCollideWithNormalHeading(): void + { + $punct = new Heading(2); + $punct->appendChild(new Text('!!!')); + + $sOne = new Heading(2); + $sOne->appendChild(new Text('s 1')); + + $firstId = $this->tracker->getIdForHeading($punct); + $secondId = $this->tracker->getIdForHeading($sOne); + + $this->assertSame('s-1', $firstId); + $this->assertNotSame($firstId, $secondId); + } + + /** + * The fallback must also avoid explicitly tracked IDs. + */ + public function testFallbackIdAvoidsTrackedExplicitId(): void + { + $this->tracker->trackId('s-1'); + + $heading = new Heading(2); + $heading->appendChild(new Text('###')); + + $id = $this->tracker->getIdForHeading($heading); + + $this->assertSame('s-2', $id); + } + + public function testNonAsciiPunctuationHeadingIsPreserved(): void + { + $heading = new Heading(2); + $heading->appendChild(new Text('Spec — Notes')); + + $id = $this->tracker->getIdForHeading($heading); + + $this->assertSame('Spec-—-Notes', $id); } }