Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 29 additions & 25 deletions docs/reference/enhancements.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,35 +156,35 @@ content is a symbol falls back to a generated `s-N` ID.

## CSS-Safe Heading IDs

**Related:** [php-collective/djot-php#92](https://github.com/php-collective/djot-php/pull/92), [jgm/djot#391](https://github.com/jgm/djot/issues/391)
**Related:** [php-collective/djot-php#92](https://github.com/php-collective/djot-php/pull/92), [jgm/djot#391](https://github.com/jgm/djot/issues/391), [jgm/djot#393](https://github.com/jgm/djot/pull/393)

**Status:** Implemented in djot-php

Auto-generated heading IDs are normalized to be valid CSS selectors, ensuring compatibility with `querySelector()`, HTMX scroll restoration, and CSS attribute selectors.
Auto-generated heading IDs follow the djot spec rule settled in [jgm/djot#393](https://github.com/jgm/djot/pull/393), with two small deviations that keep every ID a valid CSS selector — ensuring compatibility with `querySelector()`, HTMX scroll restoration, and CSS attribute selectors.

### Normalization Rules

1. **Strip `#` characters** — Prevents invalid selectors
2. **Trim whitespace** — Clean leading/trailing spaces
3. **Whitespace to dashes** — Spaces become single `-`
4. **Invalid characters to dashes** — Only Unicode letters (`\p{L}`), numbers (`\p{N}`), hyphens, and underscores are preserved
5. **Collapse consecutive dashes** — `foo--bar` becomes `foo-bar`
6. **Trim leading/trailing dashes** — `-foo-` becomes `foo`
7. **Prefix digits** — IDs starting with a number get `h-` prefix (CSS requirement)
8. **Fallback** — Empty results become `heading`
1. **Exclude non-textual content** — Symbols (`:name:`) and footnote references do not contribute to the ID
2. **Replace non-alphanumeric ASCII** — Each maximal run of non-alphanumeric ASCII characters (anything except `A–Z`, `a–z`, `0–9`) becomes a single `-`
3. **Preserve non-ASCII** — Unicode letters, digits, punctuation, and symbols are kept verbatim (they are valid CSS identifier code points)
4. **Trim leading/trailing dashes** — `-foo-` becomes `foo`
5. **Prefix digits** *(CSS deviation)* — A result starting with an ASCII digit gets an `h-` prefix (a CSS identifier cannot start with a digit)
6. **Empty fallback** *(CSS deviation)* — A result with no usable content falls back to a generated `s-N` identifier (matching djot.js)

### Examples

| Heading | Generated ID |
|---------|--------------|
| `# Hello World` | `Hello-World` |
| `# Hello World!` | `Hello-World` |
| `# foo_bar` | `foo-bar` |
| `# 日本語の見出し` | `日本語の見出し` |
| `# Привет мир` | `Привет-мир` |
| `# Spec — Notes` | `Spec-—-Notes` |
| `# E=mc^2` | `E-mc-2` |
| `# 123 Numbers First` | `h-123-Numbers-First` |
| `# $this->method()` | `this-method` |
| `# ###` | `heading` |
| `# ###` | `s-1` |

### Unicode Preservation

Expand Down Expand Up @@ -231,23 +231,27 @@ Explicit IDs are used as-is without normalization.

### Spec Alignment

The remove-vs-replace question raised in [jgm/djot#391](https://github.com/jgm/djot/issues/391) was settled by [jgm/djot#393](https://github.com/jgm/djot/pull/393), which reworded the spec to: *"replacing each maximal run of non-alphanumeric ASCII characters with `-`, removing any leading or trailing `-`"*. Note that #393 only changes the spec **prose** — the djot.js reference implementation is unchanged and (per djot's own changelog policy) remains the authoritative behavior. The new prose is actually broader than djot.js itself: it would also strip `_`, which djot.js keeps.
The remove-vs-replace question raised in [jgm/djot#391](https://github.com/jgm/djot/issues/391) was settled by [jgm/djot#393](https://github.com/jgm/djot/pull/393), which reworded the spec to: *"replacing each maximal run of non-alphanumeric ASCII characters with `-`, removing any leading or trailing `-`"*.

djot-php replaces (does not remove) mid-word punctuation — the direction #393 settled on — and tracks the djot.js **implementation** where the prose and implementation disagree, deliberately deviating only where required to produce valid CSS identifiers for `querySelector()` consumers.
djot-php follows that **prose verbatim**: every non-alphanumeric ASCII run (including `_`, `'`, `"`, `:`, `;`) becomes `-`, and every non-ASCII code point — letters, digits, and punctuation/symbols alike — is preserved. The only two deviations are the CSS-validity adjustments (leading-digit `h-` prefix, empty → `s-N`), which the spec leaves unspecified anyway.

| Aspect | djot.js reference impl | #393 spec prose | djot-php |
|--------|------------------------|-----------------|----------|
| Mid-word punctuation (`A+B=C`) | `A-B-C` | `A-B-C` | `A-B-C` |
| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` | preserve → `Über-uns` |
| Consecutive punctuation (`foo...bar`) | collapse → `foo-bar` | collapse → `foo-bar` | collapse → `foo-bar` |
| Underscore (`foo_bar`) | keep → `foo_bar` | strip → `foo-bar` | keep → `foo_bar` (follows impl; CSS-valid) |
| Apostrophe (`That's all`) | preserve → `That's-all` | replace → `That-s-all` | replace → `That-s-all` (CSS-safe) |
| Double quote / `;` / `:` | preserve | replace | replace with `-` (CSS-safe) |
| Leading digit (`2024 recap`) | `2024-recap` | `2024-recap` | prefix → `h-2024-recap` (CSS requires non-digit start) |
| Empty result (`!!!`) | `s-N` family | unspecified | fallback → `heading` |
| Symbols / footnote refs | excluded | excluded | excluded |
| Aspect | #393 spec prose | djot-php |
|--------|-----------------|----------|
| Mid-word punctuation (`A+B=C`) | `A-B-C` | `A-B-C` |
| Underscore (`foo_bar`) | replace → `foo-bar` | replace → `foo-bar` |
| Apostrophe (`That's all`) | replace → `That-s-all` | replace → `That-s-all` |
| Double quote / `;` / `:` | replace → `-` | replace → `-` |
| Consecutive punctuation (`foo...bar`) | collapse → `foo-bar` | collapse → `foo-bar` |
| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` |
| Non-ASCII punctuation / smart quotes (`Bob’s Guide`) | preserve → `Bob’s-Guide` | preserve → `Bob’s-Guide` |
| Leading digit (`2024 recap`) | `2024-recap` | **`h-2024-recap`** *(CSS deviation)* |
| Empty result (`!!!`, `###`) | unspecified | **`s-N` fallback** *(CSS deviation)* |
| Symbols / footnote refs | excluded | excluded |

The apostrophe / quote / semicolon / colon deviation is deliberate: these characters are not valid in unescaped CSS identifiers, so preserving them per djot.js would force every JS consumer to round-trip through `CSS.escape()` before doing a selector lookup. The leading-digit and empty-result behaviors fill in gaps that the spec and implementation handle inconsistently.
> [!NOTE]
> #393 only changed the spec **prose**, not the djot.js reference implementation, which still uses a fixed punctuation denylist that *preserves* `_ ' " : ;` and does not collapse `-` runs. So djot-php (following the prose) intentionally differs from current djot.js output on those ASCII characters. This prose-vs-implementation gap is tracked upstream in [jgm/djot#391](https://github.com/jgm/djot/issues/391); when smart punctuation is enabled, note that both the prose *and* djot.js preserve the resulting non-ASCII quotes, so `# Bob's Guide` → `id="Bob’s-Guide"` (the heading-reference extension stays consistent because the link target runs through the same normalization).

The two CSS-validity deviations are deliberate: a leading digit makes `querySelector('#2col')` a syntax error, and an empty/sentinel id is worse than a stable generated `s-N`. Everything else is verbatim #393.

---

Expand Down
68 changes: 41 additions & 27 deletions src/Renderer/HeadingIdTracker.php
Original file line number Diff line number Diff line change
Expand Up @@ -88,37 +88,43 @@ public function trackId(string $id): void
}

/**
* Normalize text into a valid CSS identifier string
* Normalize heading text into an identifier (jgm/djot#393)
*
* 1. Strip # characters entirely
* 2. Trim whitespace
* 3. Replace whitespace sequences (including Unicode spaces) with single dashes
* 4. Replace any remaining characters that are invalid in CSS identifiers
* (anything other than Unicode letters/numbers, hyphens, and underscores)
* with dashes
* 5. Collapse consecutive dashes and trim leading/trailing dashes
* 6. Prefix with 'h-' if the result starts with a digit, ensuring a valid
* CSS ident start (digits are not allowed as the first character)
* Each maximal run of non-alphanumeric ASCII characters is replaced with
* a single `-`, and leading/trailing `-` are trimmed. Non-ASCII
* characters (Unicode letters, digits, punctuation, symbols) are
* preserved verbatim — they fall outside the spec's ASCII replacement
* set and are valid CSS identifier code points.
*
* Producing a valid CSS identifier ensures that consumers such as HTMX,
* which call `querySelector` with the section ID for scroll-restoration,
* do not throw a SyntaxError when headings contain inline code or special
* characters (e.g. `$this->t($key, $params = [], $fallback = '')`).
* Two deliberate, documented deviations keep the result a valid CSS
* identifier for `querySelector()` / HTMX consumers:
* - a leading ASCII digit gets an `h-` prefix (a CSS identifier cannot
* start with a digit);
* - an empty result is returned as `''` so the caller can fall back to
* a generated `s-N` identifier (matching djot.js), rather than a
* literal sentinel.
Comment on lines +99 to +105
*
* @return string The identifier, or '' when the text has no usable content.
*/
public function normalizeId(string $text): string
{
$id = str_replace('#', '', $text);
$id = trim($id);
$id = preg_replace('/\s+/u', '-', $id) ?? $id;
$id = preg_replace('/[^\p{L}\p{N}_-]+/u', '-', $id) ?? $id;
$id = preg_replace('/-{2,}/', '-', $id) ?? $id;
// 0x30-0x39 = 0-9, 0x41-0x5A = A-Z, 0x61-0x7A = a-z. Every other
// byte in 0x00-0x7F is non-alphanumeric ASCII; bytes >= 0x80 (all
// UTF-8 multibyte sequences) are left untouched so non-ASCII text is
// preserved. No /u flag: the class only ever matches single ASCII
// bytes, never a continuation byte of a multibyte character.
$id = preg_replace('/[\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x7F]+/', '-', $text) ?? $text;
$id = trim($id, '-');

if ($id !== '' && preg_match('/^\p{N}/u', $id)) {
if ($id === '') {
return '';
}

if (preg_match('/^[0-9]/', $id) === 1) {
$id = 'h-' . $id;
}

return $id !== '' ? $id : 'heading';
return $id;
}

/**
Expand Down Expand Up @@ -216,15 +222,23 @@ protected function generateId(Heading $node): string
// excluding non-textual elements such as symbols and footnote
// references (jgm/djot#393).
$idText = $this->extractPlainText($node, forId: true);
$baseId = $this->normalizeId($idText);

if ($idText === '') {
// Generate fallback ID
$this->sectionCounter++;
if ($baseId === '') {
// No usable content (empty heading, or text that is entirely
// ASCII punctuation) — fall back to a generated `s-N`
// identifier, matching djot.js. Skip any `s-N` already taken by
// an explicit id or a heading whose text normalizes to it, so
// the fallback never produces a duplicate.
do {
$this->sectionCounter++;
$baseId = 's-' . $this->sectionCounter;
} while (isset($this->usedIds[$baseId]));

return 's-' . $this->sectionCounter;
}
$this->usedIds[$baseId] = 0;

$baseId = $this->normalizeId($idText);
return $baseId;
}

// Track and deduplicate
if (!isset($this->usedIds[$baseId])) {
Expand Down
16 changes: 11 additions & 5 deletions tests/TestCase/Extension/HeadingReferenceExtensionTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -172,16 +172,19 @@ public function testHeadingWithSmartQuotesMatchesStraightQuoteReference(): void
$converter = new DjotConverter();
$converter->addExtension(new HeadingReferenceExtension());

// The parser converts straight quotes to smart quotes in heading text,
// but reference targets keep straight quotes. The extension normalizes
// quotes for matching so this should resolve correctly.
// The parser converts straight quotes to smart quotes in heading text.
// Per jgm/djot#393 the resulting non-ASCII quote characters are
// preserved in the identifier; the reference still resolves because
// both the heading ID and the link target run through the same
// normalization, so the href must equal the section id verbatim.
$html = $converter->convert(<<<'DJOT'
See [[Say "Hello"]].

# Say "Hello"
DJOT);

$this->assertStringContainsString('href="#Say-Hello"', $html);
$this->assertStringContainsString('id="Say-“Hello”"', $html);
$this->assertStringContainsString('href="#Say-“Hello”"', $html);
$this->assertStringNotContainsString('[[Say "Hello"]]', $html);
}

Expand Down Expand Up @@ -210,7 +213,10 @@ public function testHeadingWithApostropheResolvesCorrectly(): void
# Bob's Guide
DJOT);

$this->assertStringContainsString('href="#Bob-s-Guide"', $html);
// Smart-punctuation turns the apostrophe into U+2019, which jgm/djot#393
// preserves (non-ASCII). The href must match the generated section id.
$this->assertStringContainsString('id="Bob’s-Guide"', $html);
$this->assertStringContainsString('href="#Bob’s-Guide"', $html);
$this->assertStringNotContainsString('data-heading-ref=', $html);
$this->assertStringNotContainsString('[[Bob\'s Guide]]', $html);
}
Expand Down
90 changes: 78 additions & 12 deletions tests/TestCase/Renderer/HeadingIdTrackerTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -160,29 +160,36 @@ public function testNormalizeId(): void
$this->assertSame('this-t-key-params-fallback', $this->tracker->normalizeId("\$this->t(\$key, \$params = [], \$fallback = '')"));
$this->assertSame('My-title', $this->tracker->normalizeId('My --- title'));
$this->assertSame('日本語の見出し', $this->tracker->normalizeId('日本語の見出し'));
$this->assertSame('heading', $this->tracker->normalizeId('###'));
$this->assertSame('', $this->tracker->normalizeId('###'));
$this->assertSame('h-123-Things', $this->tracker->normalizeId('123 Things'));
$this->assertSame('h-1-Introduction', $this->tracker->normalizeId('1. Introduction'));
}

/**
* Pins behaviour discussed in jgm/djot#391 (spec wording on auto-ID generation).
* Pins the auto-ID rule settled in jgm/djot#393: each maximal run of
* non-alphanumeric ASCII characters is replaced with `-`, leading/trailing
* `-` are trimmed, and non-ASCII characters are preserved verbatim.
*
* djot-php sides with djot.js / djoths on remove-vs-replace (mid-word punctuation
* becomes `-`), and deliberately deviates on apostrophes / quotes / `;` / `:` by
* also replacing them, so generated IDs are valid CSS identifiers and safe to use
* with `querySelector()`.
* djot-php follows this prose, including dropping the previous `_`
* exception. The only deliberate deviations are the two CSS-validity
* adjustments (leading-digit `h-` prefix, empty result → `s-N` fallback),
* which the heading-level tests cover.
*/
public function testNormalizeIdSpecAlignmentEdgeCases(): void
{
$this->assertSame('A-B-C', $this->tracker->normalizeId('A+B=C'));
$this->assertSame('Emphasis-strong', $this->tracker->normalizeId('Emphasis/strong'));
$this->assertSame('That-s-all', $this->tracker->normalizeId("That's all"));
$this->assertSame('foo-bar', $this->tracker->normalizeId('foo...bar'));
$this->assertSame('foo-bar-baz', $this->tracker->normalizeId('foo_bar baz'));
$this->assertSame('Uber-uns', $this->tracker->normalizeId('Uber uns'));
$this->assertSame('Über-uns', $this->tracker->normalizeId('Über uns'));
// Non-ASCII punctuation/symbols are not "non-alphanumeric ASCII", so
// they are preserved (and are valid CSS identifier code points).
$this->assertSame('A–B', $this->tracker->normalizeId('A–B'));
$this->assertSame('café—bar', $this->tracker->normalizeId('café—bar'));
$this->assertSame('h-2024-recap', $this->tracker->normalizeId('2024 recap'));
$this->assertSame('heading', $this->tracker->normalizeId('!!!'));
$this->assertSame('', $this->tracker->normalizeId('!!!'));
}

public function testGetPlainText(): void
Expand Down Expand Up @@ -348,17 +355,76 @@ public function testSymbolsRetainedInPlainText(): void
}

/**
* djot.js keeps `_` (it is not in its punctuation denylist) and it is a
* valid CSS identifier character, so djot-php keeps it too. This pins the
* deliberate divergence from the looser #393 spec prose.
* Per the jgm/djot#393 wording, `_` is a non-alphanumeric ASCII character
* and is replaced with `-` like any other punctuation (the previous `_`
* exception is gone).
*/
public function testUnderscoreRetainedInId(): void
public function testUnderscoreReplacedInId(): void
{
$heading = new Heading(2);
$heading->appendChild(new Text('foo_bar baz'));

$id = $this->tracker->getIdForHeading($heading);

$this->assertSame('foo_bar-baz', $id);
$this->assertSame('foo-bar-baz', $id);
}

/**
* A heading whose text normalizes to nothing (all ASCII punctuation)
* falls back to a generated `s-N` identifier, matching djot.js — not the
* literal `heading` sentinel djot-php used previously.
*/
public function testAllPunctuationHeadingGetsFallbackId(): void
{
$heading = new Heading(2);
$heading->appendChild(new Text('!!!'));

$id = $this->tracker->getIdForHeading($heading);

$this->assertSame('s-1', $id);
}

/**
* The generated `s-N` fallback must not collide with a real heading whose
* text normalizes to the same value (e.g. `# s 1` → `s-1`).
*/
public function testFallbackIdDoesNotCollideWithNormalHeading(): void
{
$punct = new Heading(2);
$punct->appendChild(new Text('!!!'));

$sOne = new Heading(2);
$sOne->appendChild(new Text('s 1'));

$firstId = $this->tracker->getIdForHeading($punct);
$secondId = $this->tracker->getIdForHeading($sOne);

$this->assertSame('s-1', $firstId);
$this->assertNotSame($firstId, $secondId);
}

/**
* The fallback must also avoid explicitly tracked IDs.
*/
public function testFallbackIdAvoidsTrackedExplicitId(): void
{
$this->tracker->trackId('s-1');

$heading = new Heading(2);
$heading->appendChild(new Text('###'));

$id = $this->tracker->getIdForHeading($heading);

$this->assertSame('s-2', $id);
}

public function testNonAsciiPunctuationHeadingIsPreserved(): void
{
$heading = new Heading(2);
$heading->appendChild(new Text('Spec — Notes'));

$id = $this->tracker->getIdForHeading($heading);

$this->assertSame('Spec-—-Notes', $id);
}
}
Loading