Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 31 additions & 15 deletions docs/reference/enhancements.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ They are either on the way to get incorporated upstream - or may be incorporated

- [Tab Indentation Support](#tab-indentation-support)
- [Multiple Footnote References](#multiple-footnote-references)
- [Section ID Excludes Footnote Markers](#section-id-excludes-footnote-markers)
- [Section ID Excludes Footnote Markers and Symbols](#section-id-excludes-footnote-markers-and-symbols)
- [CSS-Safe Heading IDs](#css-safe-heading-ids)
- [Symbol Parsing in Time Formats](#symbol-parsing-in-time-formats)
- [Em/En Dash with Unmatched Braces](#em-en-dash-with-unmatched-braces)
Expand Down Expand Up @@ -117,13 +117,15 @@ and third<a id="fnref1-3" href="#fn1" role="doc-noteref"><sup>1</sup></a>.</p>

---

## Section ID Excludes Footnote Markers
## Section ID Excludes Footnote Markers and Symbols

**Related:** [jgm/djot#349](https://github.com/jgm/djot/issues/349)
**Related:** [jgm/djot#349](https://github.com/jgm/djot/issues/349), [jgm/djot#393](https://github.com/jgm/djot/pull/393)

**Status:** Implemented in djot-php

Auto-generated section IDs correctly exclude footnote reference markers:
Per the djot spec, an auto-generated identifier is formed from the plain text
content of the heading *"excluding non-textual elements such as footnote
references and symbols"*. djot-php excludes both:

```djot
# Introduction[^1]
Expand All @@ -140,6 +142,16 @@ Auto-generated section IDs correctly exclude footnote reference markers:

The ID is `Introduction`, not `Introduction1` or `Introduction[^1]`.

Symbols are likewise dropped from the identifier (but kept in the
human-readable plain text used for things like TOC labels):

```djot
# Release notes :tada:
```

The ID is `Release-notes`, not `Release-notes-tada`. A heading whose only
content is a symbol falls back to a generated `s-N` ID.

---

## CSS-Safe Heading IDs
Expand Down Expand Up @@ -219,19 +231,23 @@ Explicit IDs are used as-is without normalization.

### Spec Alignment

The djot spec's wording on auto-ID generation is being clarified in [jgm/djot#391](https://github.com/jgm/djot/issues/391). djot-php's normalization aligns with the proposed direction in most respects and deliberately deviates in two places — both motivated by producing valid CSS identifiers for `querySelector()` consumers.
The remove-vs-replace question raised in [jgm/djot#391](https://github.com/jgm/djot/issues/391) was settled by [jgm/djot#393](https://github.com/jgm/djot/pull/393), which reworded the spec to: *"replacing each maximal run of non-alphanumeric ASCII characters with `-`, removing any leading or trailing `-`"*. Note that #393 only changes the spec **prose** — the djot.js reference implementation is unchanged and (per djot's own changelog policy) remains the authoritative behavior. The new prose is actually broader than djot.js itself: it would also strip `_`, which djot.js keeps.

djot-php replaces (does not remove) mid-word punctuation — the direction #393 settled on — and tracks the djot.js **implementation** where the prose and implementation disagree, deliberately deviating only where required to produce valid CSS identifiers for `querySelector()` consumers.

| Aspect | djot.js / djoths (proposed spec) | djot-php |
|--------|---------------------------------|----------|
| Mid-word punctuation (`A+B=C`) | replace with `-` → `A-B-C` | replace with `-` → `A-B-C` |
| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` |
| Consecutive punctuation (`foo...bar`) | collapse to single `-` → `foo-bar` | collapse to single `-` → `foo-bar` |
| Apostrophe (`That's all`) | preserve → `That's-all` | replace with `-` → `That-s-all` |
| Double quote / `;` / `:` | preserve | replace with `-` |
| Leading digit (`2024 recap`) | unspecified | prefix with `h-` → `h-2024-recap` |
| Empty result (`!!!`) | unspecified | fallback → `heading` |
| Aspect | djot.js reference impl | #393 spec prose | djot-php |
|--------|------------------------|-----------------|----------|
| Mid-word punctuation (`A+B=C`) | `A-B-C` | `A-B-C` | `A-B-C` |
| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` | preserve → `Über-uns` |
| Consecutive punctuation (`foo...bar`) | collapse → `foo-bar` | collapse → `foo-bar` | collapse → `foo-bar` |
| Underscore (`foo_bar`) | keep → `foo_bar` | strip → `foo-bar` | keep → `foo_bar` (follows impl; CSS-valid) |
| Apostrophe (`That's all`) | preserve → `That's-all` | replace → `That-s-all` | replace → `That-s-all` (CSS-safe) |
| Double quote / `;` / `:` | preserve | replace | replace with `-` (CSS-safe) |
| Leading digit (`2024 recap`) | `2024-recap` | `2024-recap` | prefix → `h-2024-recap` (CSS requires non-digit start) |
| Empty result (`!!!`) | `s-N` family | unspecified | fallback → `heading` |
| Symbols / footnote refs | excluded | excluded | excluded |

The apostrophe / quote / semicolon / colon deviation is deliberate: these characters are not valid in unescaped CSS identifiers, so preserving them per the spec would force every JS consumer to round-trip through `CSS.escape()` before doing a selector lookup. The leading-digit and empty-result behaviors fill in spec gaps that other implementations handle inconsistently.
The apostrophe / quote / semicolon / colon deviation is deliberate: these characters are not valid in unescaped CSS identifiers, so preserving them per djot.js would force every JS consumer to round-trip through `CSS.escape()` before doing a selector lookup. The leading-digit and empty-result behaviors fill in gaps that the spec and implementation handle inconsistently.

---

Expand Down
33 changes: 26 additions & 7 deletions src/Renderer/HeadingIdTracker.php
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

use Djot\Node\Block\Heading;
use Djot\Node\Inline\Code;
use Djot\Node\Inline\FootnoteRef;
use Djot\Node\Inline\HardBreak;
use Djot\Node\Inline\Math;
use Djot\Node\Inline\SoftBreak;
Expand Down Expand Up @@ -146,21 +147,33 @@ public function getPlainText(Node $node): string

/**
* Recursively extract plain text from a node tree
*
* When $forId is true, non-textual elements that the djot spec excludes
* from auto-generated heading identifiers are skipped: symbols (`:name:`)
* and footnote references (`[^label]`). See jgm/djot#393. Otherwise the
* full human-readable text is returned (e.g. for TOC labels), with
* symbols rendered as `:name:`.
*/
protected function extractPlainText(Node $node): string
protected function extractPlainText(Node $node, bool $forId = false): string
{
$text = '';
foreach ($node->getChildren() as $child) {
if ($child instanceof FootnoteRef) {
continue;
}

if ($child instanceof Text) {
$text .= $child->getContent();
} elseif ($child instanceof SoftBreak || $child instanceof HardBreak) {
$text .= ' ';
} elseif ($child instanceof Code || $child instanceof Math) {
$text .= $child->getContent();
} elseif ($child instanceof Symbol) {
$text .= ':' . $child->getName() . ':';
if (!$forId) {
$text .= ':' . $child->getName() . ':';
}
} elseif ($child instanceof Node) {
$text .= $this->extractPlainText($child);
$text .= $this->extractPlainText($child, $forId);
}
}

Expand Down Expand Up @@ -195,17 +208,23 @@ protected function generateId(Heading $node): string
return $id;
}

// Generate from heading text
$headingText = $this->getPlainText($node);
// Warm the plain-text cache so display consumers (TOC, permalinks)
// still see the pre-mutation text including symbols.
$this->getPlainText($node);

// The identifier itself is formed from the plain text content
// excluding non-textual elements such as symbols and footnote
// references (jgm/djot#393).
$idText = $this->extractPlainText($node, forId: true);

if ($headingText === '') {
if ($idText === '') {
// Generate fallback ID
$this->sectionCounter++;

return 's-' . $this->sectionCounter;
}

$baseId = $this->normalizeId($headingText);
$baseId = $this->normalizeId($idText);

// Track and deduplicate
if (!isset($this->usedIds[$baseId])) {
Expand Down
83 changes: 83 additions & 0 deletions tests/TestCase/Renderer/HeadingIdTrackerTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@
namespace Djot\Test\TestCase\Renderer;

use Djot\Node\Block\Heading;
use Djot\Node\Inline\FootnoteRef;
use Djot\Node\Inline\HardBreak;
use Djot\Node\Inline\SoftBreak;
use Djot\Node\Inline\Strong;
use Djot\Node\Inline\Symbol;
use Djot\Node\Inline\Text;
use Djot\Renderer\HeadingIdTracker;
use PHPUnit\Framework\TestCase;
Expand Down Expand Up @@ -278,4 +280,85 @@ public function testGetIdForHeadingAlsoCachesPlainText(): void
$text = $this->tracker->getPlainText($heading);
$this->assertSame('Title', $text);
}

/**
* The djot spec (and jgm/djot#393) says auto-generated identifiers are formed
* from the plain text content "excluding non-textual elements such as footnote
* references and symbols". A symbol must not leak into the ID.
*/
public function testSymbolsExcludedFromId(): void
{
$heading = new Heading(2);
$heading->appendChild(new Text('Introduction '));
$heading->appendChild(new Symbol('smile'));

$id = $this->tracker->getIdForHeading($heading);

$this->assertSame('Introduction', $id);
}

public function testHeadingWithOnlySymbolGetsFallbackId(): void
{
$heading = new Heading(2);
$heading->appendChild(new Symbol('tada'));

$id = $this->tracker->getIdForHeading($heading);

$this->assertSame('s-1', $id);
}

public function testSymbolBetweenWordsDoesNotProduceStrayDashes(): void
{
$heading = new Heading(2);
$heading->appendChild(new Text('Build'));
$heading->appendChild(new Symbol('rocket'));
$heading->appendChild(new Text('Status'));

$id = $this->tracker->getIdForHeading($heading);

$this->assertSame('BuildStatus', $id);
}

/**
* Footnote references are likewise excluded from the identifier:
* `# Introduction[^1]` generates `Introduction`, not `Introduction1`.
*/
public function testFootnoteReferenceExcludedFromId(): void
{
$heading = new Heading(2);
$heading->appendChild(new Text('Introduction'));
$heading->appendChild(new FootnoteRef('1'));

$id = $this->tracker->getIdForHeading($heading);

$this->assertSame('Introduction', $id);
}

/**
* Symbols are still part of the human-readable plain text (e.g. for TOC
* labels); only the *identifier* excludes them. This pins that boundary.
*/
public function testSymbolsRetainedInPlainText(): void
{
$heading = new Heading(2);
$heading->appendChild(new Text('Introduction '));
$heading->appendChild(new Symbol('smile'));

$this->assertSame('Introduction :smile:', $this->tracker->getPlainText($heading));
}

/**
* djot.js keeps `_` (it is not in its punctuation denylist) and it is a
* valid CSS identifier character, so djot-php keeps it too. This pins the
* deliberate divergence from the looser #393 spec prose.
*/
public function testUnderscoreRetainedInId(): void
{
$heading = new Heading(2);
$heading->appendChild(new Text('foo_bar baz'));

$id = $this->tracker->getIdForHeading($heading);

$this->assertSame('foo_bar-baz', $id);
}
}
Loading