diff --git a/docs/reference/enhancements.md b/docs/reference/enhancements.md index 984640db..09d36829 100644 --- a/docs/reference/enhancements.md +++ b/docs/reference/enhancements.md @@ -9,7 +9,7 @@ They are either on the way to get incorporated upstream - or may be incorporated - [Tab Indentation Support](#tab-indentation-support) - [Multiple Footnote References](#multiple-footnote-references) -- [Section ID Excludes Footnote Markers](#section-id-excludes-footnote-markers) +- [Section ID Excludes Footnote Markers and Symbols](#section-id-excludes-footnote-markers-and-symbols) - [CSS-Safe Heading IDs](#css-safe-heading-ids) - [Symbol Parsing in Time Formats](#symbol-parsing-in-time-formats) - [Em/En Dash with Unmatched Braces](#em-en-dash-with-unmatched-braces) @@ -117,13 +117,15 @@ and third1.

--- -## Section ID Excludes Footnote Markers +## Section ID Excludes Footnote Markers and Symbols -**Related:** [jgm/djot#349](https://github.com/jgm/djot/issues/349) +**Related:** [jgm/djot#349](https://github.com/jgm/djot/issues/349), [jgm/djot#393](https://github.com/jgm/djot/pull/393) **Status:** Implemented in djot-php -Auto-generated section IDs correctly exclude footnote reference markers: +Per the djot spec, an auto-generated identifier is formed from the plain text +content of the heading *"excluding non-textual elements such as footnote +references and symbols"*. djot-php excludes both: ```djot # Introduction[^1] @@ -140,6 +142,16 @@ Auto-generated section IDs correctly exclude footnote reference markers: The ID is `Introduction`, not `Introduction1` or `Introduction[^1]`. +Symbols are likewise dropped from the identifier (but kept in the +human-readable plain text used for things like TOC labels): + +```djot +# Release notes :tada: +``` + +The ID is `Release-notes`, not `Release-notes-tada`. A heading whose only +content is a symbol falls back to a generated `s-N` ID. + --- ## CSS-Safe Heading IDs @@ -219,19 +231,23 @@ Explicit IDs are used as-is without normalization. ### Spec Alignment -The djot spec's wording on auto-ID generation is being clarified in [jgm/djot#391](https://github.com/jgm/djot/issues/391). djot-php's normalization aligns with the proposed direction in most respects and deliberately deviates in two places — both motivated by producing valid CSS identifiers for `querySelector()` consumers. +The remove-vs-replace question raised in [jgm/djot#391](https://github.com/jgm/djot/issues/391) was settled by [jgm/djot#393](https://github.com/jgm/djot/pull/393), which reworded the spec to: *"replacing each maximal run of non-alphanumeric ASCII characters with `-`, removing any leading or trailing `-`"*. Note that #393 only changes the spec **prose** — the djot.js reference implementation is unchanged and (per djot's own changelog policy) remains the authoritative behavior. The new prose is actually broader than djot.js itself: it would also strip `_`, which djot.js keeps. + +djot-php replaces (does not remove) mid-word punctuation — the direction #393 settled on — and tracks the djot.js **implementation** where the prose and implementation disagree, deliberately deviating only where required to produce valid CSS identifiers for `querySelector()` consumers. -| Aspect | djot.js / djoths (proposed spec) | djot-php | -|--------|---------------------------------|----------| -| Mid-word punctuation (`A+B=C`) | replace with `-` → `A-B-C` | replace with `-` → `A-B-C` | -| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` | -| Consecutive punctuation (`foo...bar`) | collapse to single `-` → `foo-bar` | collapse to single `-` → `foo-bar` | -| Apostrophe (`That's all`) | preserve → `That's-all` | replace with `-` → `That-s-all` | -| Double quote / `;` / `:` | preserve | replace with `-` | -| Leading digit (`2024 recap`) | unspecified | prefix with `h-` → `h-2024-recap` | -| Empty result (`!!!`) | unspecified | fallback → `heading` | +| Aspect | djot.js reference impl | #393 spec prose | djot-php | +|--------|------------------------|-----------------|----------| +| Mid-word punctuation (`A+B=C`) | `A-B-C` | `A-B-C` | `A-B-C` | +| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` | preserve → `Über-uns` | +| Consecutive punctuation (`foo...bar`) | collapse → `foo-bar` | collapse → `foo-bar` | collapse → `foo-bar` | +| Underscore (`foo_bar`) | keep → `foo_bar` | strip → `foo-bar` | keep → `foo_bar` (follows impl; CSS-valid) | +| Apostrophe (`That's all`) | preserve → `That's-all` | replace → `That-s-all` | replace → `That-s-all` (CSS-safe) | +| Double quote / `;` / `:` | preserve | replace | replace with `-` (CSS-safe) | +| Leading digit (`2024 recap`) | `2024-recap` | `2024-recap` | prefix → `h-2024-recap` (CSS requires non-digit start) | +| Empty result (`!!!`) | `s-N` family | unspecified | fallback → `heading` | +| Symbols / footnote refs | excluded | excluded | excluded | -The apostrophe / quote / semicolon / colon deviation is deliberate: these characters are not valid in unescaped CSS identifiers, so preserving them per the spec would force every JS consumer to round-trip through `CSS.escape()` before doing a selector lookup. The leading-digit and empty-result behaviors fill in spec gaps that other implementations handle inconsistently. +The apostrophe / quote / semicolon / colon deviation is deliberate: these characters are not valid in unescaped CSS identifiers, so preserving them per djot.js would force every JS consumer to round-trip through `CSS.escape()` before doing a selector lookup. The leading-digit and empty-result behaviors fill in gaps that the spec and implementation handle inconsistently. --- diff --git a/src/Renderer/HeadingIdTracker.php b/src/Renderer/HeadingIdTracker.php index 20c23907..78bdd017 100644 --- a/src/Renderer/HeadingIdTracker.php +++ b/src/Renderer/HeadingIdTracker.php @@ -6,6 +6,7 @@ use Djot\Node\Block\Heading; use Djot\Node\Inline\Code; +use Djot\Node\Inline\FootnoteRef; use Djot\Node\Inline\HardBreak; use Djot\Node\Inline\Math; use Djot\Node\Inline\SoftBreak; @@ -146,11 +147,21 @@ public function getPlainText(Node $node): string /** * Recursively extract plain text from a node tree + * + * When $forId is true, non-textual elements that the djot spec excludes + * from auto-generated heading identifiers are skipped: symbols (`:name:`) + * and footnote references (`[^label]`). See jgm/djot#393. Otherwise the + * full human-readable text is returned (e.g. for TOC labels), with + * symbols rendered as `:name:`. */ - protected function extractPlainText(Node $node): string + protected function extractPlainText(Node $node, bool $forId = false): string { $text = ''; foreach ($node->getChildren() as $child) { + if ($child instanceof FootnoteRef) { + continue; + } + if ($child instanceof Text) { $text .= $child->getContent(); } elseif ($child instanceof SoftBreak || $child instanceof HardBreak) { @@ -158,9 +169,11 @@ protected function extractPlainText(Node $node): string } elseif ($child instanceof Code || $child instanceof Math) { $text .= $child->getContent(); } elseif ($child instanceof Symbol) { - $text .= ':' . $child->getName() . ':'; + if (!$forId) { + $text .= ':' . $child->getName() . ':'; + } } elseif ($child instanceof Node) { - $text .= $this->extractPlainText($child); + $text .= $this->extractPlainText($child, $forId); } } @@ -195,17 +208,23 @@ protected function generateId(Heading $node): string return $id; } - // Generate from heading text - $headingText = $this->getPlainText($node); + // Warm the plain-text cache so display consumers (TOC, permalinks) + // still see the pre-mutation text including symbols. + $this->getPlainText($node); + + // The identifier itself is formed from the plain text content + // excluding non-textual elements such as symbols and footnote + // references (jgm/djot#393). + $idText = $this->extractPlainText($node, forId: true); - if ($headingText === '') { + if ($idText === '') { // Generate fallback ID $this->sectionCounter++; return 's-' . $this->sectionCounter; } - $baseId = $this->normalizeId($headingText); + $baseId = $this->normalizeId($idText); // Track and deduplicate if (!isset($this->usedIds[$baseId])) { diff --git a/tests/TestCase/Renderer/HeadingIdTrackerTest.php b/tests/TestCase/Renderer/HeadingIdTrackerTest.php index 6476323a..8ab68f13 100644 --- a/tests/TestCase/Renderer/HeadingIdTrackerTest.php +++ b/tests/TestCase/Renderer/HeadingIdTrackerTest.php @@ -5,9 +5,11 @@ namespace Djot\Test\TestCase\Renderer; use Djot\Node\Block\Heading; +use Djot\Node\Inline\FootnoteRef; use Djot\Node\Inline\HardBreak; use Djot\Node\Inline\SoftBreak; use Djot\Node\Inline\Strong; +use Djot\Node\Inline\Symbol; use Djot\Node\Inline\Text; use Djot\Renderer\HeadingIdTracker; use PHPUnit\Framework\TestCase; @@ -278,4 +280,85 @@ public function testGetIdForHeadingAlsoCachesPlainText(): void $text = $this->tracker->getPlainText($heading); $this->assertSame('Title', $text); } + + /** + * The djot spec (and jgm/djot#393) says auto-generated identifiers are formed + * from the plain text content "excluding non-textual elements such as footnote + * references and symbols". A symbol must not leak into the ID. + */ + public function testSymbolsExcludedFromId(): void + { + $heading = new Heading(2); + $heading->appendChild(new Text('Introduction ')); + $heading->appendChild(new Symbol('smile')); + + $id = $this->tracker->getIdForHeading($heading); + + $this->assertSame('Introduction', $id); + } + + public function testHeadingWithOnlySymbolGetsFallbackId(): void + { + $heading = new Heading(2); + $heading->appendChild(new Symbol('tada')); + + $id = $this->tracker->getIdForHeading($heading); + + $this->assertSame('s-1', $id); + } + + public function testSymbolBetweenWordsDoesNotProduceStrayDashes(): void + { + $heading = new Heading(2); + $heading->appendChild(new Text('Build')); + $heading->appendChild(new Symbol('rocket')); + $heading->appendChild(new Text('Status')); + + $id = $this->tracker->getIdForHeading($heading); + + $this->assertSame('BuildStatus', $id); + } + + /** + * Footnote references are likewise excluded from the identifier: + * `# Introduction[^1]` generates `Introduction`, not `Introduction1`. + */ + public function testFootnoteReferenceExcludedFromId(): void + { + $heading = new Heading(2); + $heading->appendChild(new Text('Introduction')); + $heading->appendChild(new FootnoteRef('1')); + + $id = $this->tracker->getIdForHeading($heading); + + $this->assertSame('Introduction', $id); + } + + /** + * Symbols are still part of the human-readable plain text (e.g. for TOC + * labels); only the *identifier* excludes them. This pins that boundary. + */ + public function testSymbolsRetainedInPlainText(): void + { + $heading = new Heading(2); + $heading->appendChild(new Text('Introduction ')); + $heading->appendChild(new Symbol('smile')); + + $this->assertSame('Introduction :smile:', $this->tracker->getPlainText($heading)); + } + + /** + * djot.js keeps `_` (it is not in its punctuation denylist) and it is a + * valid CSS identifier character, so djot-php keeps it too. This pins the + * deliberate divergence from the looser #393 spec prose. + */ + public function testUnderscoreRetainedInId(): void + { + $heading = new Heading(2); + $heading->appendChild(new Text('foo_bar baz')); + + $id = $this->tracker->getIdForHeading($heading); + + $this->assertSame('foo_bar-baz', $id); + } }