diff --git a/docs/extensions/index.md b/docs/extensions/index.md index a9786da9..7bacdd19 100644 --- a/docs/extensions/index.md +++ b/docs/extensions/index.md @@ -7,6 +7,7 @@ Extensions provide a clean way to bundle related customizations together. Each e | Extension | Description | |-----------|-------------| | [AdmonitionExtension](#admonitionextension) | Transforms divs into semantic admonition markup with accessibility support | +| [AsciiHeadingIdsExtension](#asciiheadingidsextension) | Folds auto-generated heading ids to ASCII (`Über` → `Uber`) for URL/CSS-fragment portability | | [AutolinkExtension](#autolinkextension) | Auto-links bare URLs and email addresses | | [CodeGroupExtension](#codegroupextension) | Transforms code-group divs into tabbed code block interfaces | | [DefaultAttributesExtension](#defaultattributesextension) | Adds default attributes to elements by type | @@ -47,6 +48,7 @@ Extensions are applied in registration order. Generally, order doesn't matter, b - **AutolinkExtension** should be registered before **ExternalLinksExtension** if you want auto-linked URLs to also get external link attributes - **TableOfContentsExtension** should be registered before **HeadingPermalinksExtension** if you want clean heading text in the TOC (without permalink symbols) +- **AsciiHeadingIdsExtension** has no ordering requirement relative to the heading extensions (TableOfContents, HeadingPermalinks, HeadingReference) — it installs an id transform at registration that is consumed later during parsing/rendering, so the resulting ids are consistent regardless of registration order Extensions are reset per render, so reusing the same `DjotConverter` across multiple `convert()` calls will not carry per-document extension state such as collected TOC entries into the next output. @@ -226,6 +228,31 @@ $converter->addExtension(new AdmonitionExtension( )); ``` +## AsciiHeadingIdsExtension + +By default, auto-generated heading ids follow the [jgm/djot#393](https://github.com/jgm/djot/pull/393) rule and **preserve letter case and non-ASCII characters** (`# Über café` → `id="Über-café"`). That is valid HTML5 and resolves in browsers, but the URL fragment is percent-encoded when shared (`#%C3%9Cber-caf%C3%A9`). + +`AsciiHeadingIdsExtension` folds heading ids to ASCII for maximum portability — clean fragments, friendlier to legacy anchor tooling, and trivially safe as bare CSS/JS selectors: + +```php +use Djot\DjotConverter; +use Djot\Extension\AsciiHeadingIdsExtension; + +$converter = new DjotConverter(); +$converter->addExtension(new AsciiHeadingIdsExtension()); + +$converter->convert("# Über café\n"); +//
… (default would be id="Über-café") +``` + +It applies an ASCII transliteration on top of the spec slug and re-slugs the result, so a transform that reintroduces separators (e.g. CJK romanization) still yields a clean id. The transform is wired to **both** the renderer and the parser's `[Heading][]` reference resolution, so section ids and implicit heading-link targets stay in parity. + +`ext-intl` (ICU) is used when available and romanizes scripts the built-in map does not cover (Greek, CJK, Arabic, …); otherwise a baked Unicode→ASCII map is used. Pass `new AsciiHeadingIdsExtension(useIntl: false)` to force the baked map. A heading whose text reduces to nothing under transliteration falls back to a generated `s-N` id. + +Registration order relative to other heading extensions (TableOfContents, HeadingPermalinks, HeadingReference) does not matter: the transform is installed at registration and consumed later, during parsing and rendering. + +See [Heading ID Generation](/reference/enhancements#heading-id-generation) for the default rule and a full comparison table. + ## ExternalLinksExtension Adds `target="_blank"` and `rel="noopener noreferrer"` to external links (http/https URLs). diff --git a/docs/reference/enhancements.md b/docs/reference/enhancements.md index 149dc093..9ff3f0e9 100644 --- a/docs/reference/enhancements.md +++ b/docs/reference/enhancements.md @@ -155,25 +155,22 @@ content is a symbol falls back to a generated `s-N` ID. --- -## CSS-Safe Heading IDs +## Heading ID Generation -**Related:** [php-collective/djot-php#92](https://github.com/php-collective/djot-php/pull/92), [jgm/djot#391](https://github.com/jgm/djot/issues/391) +**Related:** [php-collective/djot-php#92](https://github.com/php-collective/djot-php/pull/92), [#224](https://github.com/php-collective/djot-php/pull/224), [jgm/djot#391](https://github.com/jgm/djot/issues/391), [jgm/djot#393](https://github.com/jgm/djot/pull/393) **Status:** Implemented in djot-php -Auto-generated heading IDs are normalized to be valid CSS selectors **and ASCII-only**, so they work with `querySelector()` / HTMX scroll restoration *and* survive being copied around as URL fragments (see [Why ASCII](#why-ascii) below). +Auto-generated heading IDs follow the settled [jgm/djot#393](https://github.com/jgm/djot/pull/393) rule and **preserve letter case and non-ASCII characters**. For ASCII-only anchors, opt in with [`AsciiHeadingIdsExtension`](#ascii-heading-ids-opt-in). -### Normalization Rules +### Normalization Rules (default) -1. **Transliterate to ASCII** — `Über`→`Uber`, `café`→`cafe`, `Привет`→`Privet`, smart quotes/dashes→`'"-` (then replaced) -2. **Strip `#` characters** — Prevents invalid selectors -3. **Trim whitespace** -4. **Whitespace to dashes** — Spaces become single `-` -5. **Invalid characters to dashes** — Anything other than letters, numbers, `-`, `_` becomes `-` -6. **Collapse consecutive dashes** — `foo--bar` becomes `foo-bar` -7. **Trim leading/trailing dashes** -8. **Prefix digits** — IDs starting with a digit get an `h-` prefix (CSS requirement) -9. **Fallback** — Empty results become `heading` (or a generated `s-N` for empty headings) +1. **Replace non-alphanumeric ASCII** — each maximal run of non-alphanumeric ASCII (spaces, punctuation, `_`, runs of `-`) becomes a single `-`. +2. **Trim** leading/trailing `-`. **Letter case and all non-ASCII characters (accented Latin, Cyrillic, CJK, smart quotes, …) are preserved.** +3. **Prefix `s-` for a leading digit** — a leading digit is a valid HTML id but an invalid *bare* CSS selector (`querySelector('#9-x')` throws), so it is prefixed. Orthogonal to #393, which governs punctuation only. +4. **Fallback** — an empty result (all-punctuation text) becomes a generated `s-N` id. + +Symbols (`:name:`) and footnote references are excluded from the id text (see [Section ID Excludes Footnote Markers and Symbols](#section-id-excludes-footnote-markers-and-symbols)). ### Examples @@ -181,26 +178,36 @@ Auto-generated heading IDs are normalized to be valid CSS selectors **and ASCII- |---------|--------------| | `# Hello World` | `Hello-World` | | `# Hello World!` | `Hello-World` | -| `# Über uns` | `Uber-uns` | -| `# café résumé` | `cafe-resume` | -| `# Привет мир` | `Privet-mir` | -| `# Bob's Guide` (smart quotes) | `Bob-s-Guide` | +| `# Über uns` | `Über-uns` | +| `# café résumé` | `café-résumé` | +| `# Привет мир` | `Привет-мир` | +| `# under_score` | `under-score` | | `# E=mc^2` | `E-mc-2` | -| `# 123 Numbers First` | `h-123-Numbers-First` | +| `# 123 Numbers First` | `s-123-Numbers-First` | | `# $this->method()` | `this-method` | -| `# ###` | `heading` | +| `# ###` | `s-1` | + +### ASCII heading IDs (opt-in) {#ascii-heading-ids-opt-in} + +Add `AsciiHeadingIdsExtension` to fold ids to ASCII (`Über uns` → `uber-uns`) for maximum URL/CSS-fragment portability: + +```php +use Djot\Extension\AsciiHeadingIdsExtension; + +$converter->addExtension(new AsciiHeadingIdsExtension()); +``` -### Why ASCII {#why-ascii} +It applies an ASCII transliteration on top of the #393 slug (and re-slugs the result), wired to **both** the renderer and the parser's `[Heading][]` reference resolution so section ids and link targets stay in parity. Registration order relative to other heading extensions does not matter. -Heading IDs end up as URL fragments (`…/page#Über-uns`) that get copied into chat, email and other documents, where **auto-linkers re-detect the URL heuristically**. Non-ASCII fragments are routinely: +Unicode ids are valid HTML5 and resolve in browsers (the URL fragment is percent-encoded but functional), so ASCII is a portability choice, not a correctness requirement. You may want it because heading IDs end up as URL fragments (`…/page#Über-uns`) copied into chat, email and other documents, where **auto-linkers re-detect the URL heuristically**. Non-ASCII fragments are routinely: - **truncated** — the link is cut at the first non-ASCII byte (`#Über` → `#`), producing a silent dead link; - **percent-encoded inconsistently** — `’`→`%E2%80%99`, bloating and sometimes breaking the link; - **re-normalized differently** by the receiving app (NFC/NFD), so the pasted fragment no longer matches the page's `id`. -Transliterating to ASCII keeps shared deep links robust. It's a deliberate deviation from both the djot.js reference and the [jgm/djot#393](https://github.com/jgm/djot/pull/393) spec prose (both preserve non-ASCII) — see [Spec Alignment](#spec-alignment). +Transliterating to ASCII keeps such shared deep links robust. This is what the extension opts into; the default (no extension) preserves non-ASCII to match the spec — see [Spec Alignment](#spec-alignment). -### Transliteration engine & determinism +### Transliteration engine & determinism (extension) Two engines produce the ASCII form: @@ -223,23 +230,23 @@ Explicit IDs are used as-is without normalization or transliteration. ### Spec Alignment {#spec-alignment} -The remove-vs-replace question raised in [jgm/djot#391](https://github.com/jgm/djot/issues/391) was settled by [jgm/djot#393](https://github.com/jgm/djot/pull/393), which reworded the spec to: *"replacing each maximal run of non-alphanumeric ASCII characters with `-`, removing any leading or trailing `-`"*. #393 changed only the spec **prose**; the djot.js reference implementation is unchanged. +The remove-vs-replace question raised in [jgm/djot#391](https://github.com/jgm/djot/issues/391) was settled by [jgm/djot#393](https://github.com/jgm/djot/pull/393), which reworded the spec to: *"replacing each maximal run of non-alphanumeric ASCII characters with `-`, removing any leading or trailing `-`"*. -djot-php replaces (does not remove) mid-word punctuation — the direction #393 settled on — additionally replaces `' " ; :` so IDs are valid CSS identifiers, and **transliterates non-ASCII to ASCII** so IDs stay link-safe when shared. The last point is a deliberate deviation from *both* djot.js and the #393 prose, justified by the [Why ASCII](#why-ascii) failure mode. +djot-php's **default now matches #393**: it replaces every maximal run of non-alphanumeric ASCII (including `_`, `'`, `"`, `;`, `:`) with `-`, and preserves letter case and all non-ASCII characters. The only additions are orthogonal to #393's punctuation rule: a leading-digit `s-` prefix (CSS-selector safety) and an `s-N` fallback for empty results. ASCII transliteration — the previous always-on behavior — is now opt-in via `AsciiHeadingIdsExtension`. -| Aspect | djot.js reference impl | #393 spec prose | djot-php | -|--------|------------------------|-----------------|----------| +| Aspect | djot.js / #393 | djot-php default | with AsciiHeadingIdsExtension | +|--------|----------------|------------------|-------------------------------| | Mid-word punctuation (`A+B=C`) | `A-B-C` | `A-B-C` | `A-B-C` | -| Consecutive punctuation (`foo...bar`) | collapse → `foo-bar` | collapse → `foo-bar` | collapse → `foo-bar` | -| Underscore (`foo_bar`) | keep → `foo_bar` | strip → `foo-bar` | keep → `foo_bar` (CSS-valid, link-safe) | -| Apostrophe / `"` / `;` / `:` | preserve | replace | replace → `-` (CSS-safe) | -| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` | **transliterate → `Uber-uns`** (link-safe) | -| Non-ASCII / smart quotes (`Bob’s`) | preserve → `Bob’s` | preserve → `Bob’s` | **transliterate → `Bob-s`** (link-safe) | -| Leading digit (`2024 recap`) | `2024-recap` | `2024-recap` | prefix → `h-2024-recap` (CSS requires non-digit start) | -| Empty result (`!!!`) | `s-N` family | unspecified | fallback → `heading` | +| Consecutive punctuation (`foo...bar`) | `foo-bar` | `foo-bar` | `foo-bar` | +| Underscore (`foo_bar`) | `foo-bar` | `foo-bar` | `foo-bar` | +| Apostrophe / `"` / `;` / `:` | replace → `-` | replace → `-` | replace → `-` | +| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` | **fold → `uber-uns`** | +| Smart quotes (`Bob’s`) | preserve → `Bob’s` | preserve → `Bob’s` | **fold → `Bob-s`** | +| Leading digit (`2024 recap`) | `2024-recap` | prefix → `s-2024-recap` | `s-2024-recap` | +| Empty result (`!!!`) | `s-N` family | `s-N` | `s-N` | | Symbols / footnote refs | excluded | excluded | excluded | -The deviations are deliberate: `' " ; :` are not valid in unescaped CSS identifiers, and non-ASCII fragments break when shared (see [Why ASCII](#why-ascii)). The leading-digit and empty-result behaviors fill in gaps the spec and reference handle inconsistently. A note proposing the spec clarify the non-ASCII question is tracked against [jgm/djot#391](https://github.com/jgm/djot/issues/391). +The default is spec-faithful; the leading-digit `s-` prefix and `s-N` fallback fill in cases #393 leaves to the implementation. The ASCII-folding column is only active when the extension is registered. --- diff --git a/src/Extension/AsciiHeadingIdsExtension.php b/src/Extension/AsciiHeadingIdsExtension.php new file mode 100644 index 00000000..944d825e --- /dev/null +++ b/src/Extension/AsciiHeadingIdsExtension.php @@ -0,0 +1,47 @@ + Uber, café -> cafe, Привет -> + * Privet) for maximum URL/CSS-fragment portability. + * + * By default djot-php generates spec-faithful ids (jgm/djot#393) that preserve + * non-ASCII characters. Adding this extension layers an ASCII transliteration on top + * of that, as a pluggable id transform - it does not fork the core slugger. + * + * The transform is wired to BOTH the renderer's HeadingIdTracker and the parser's + * heading-reference resolution pass, so `
` values and implicit + * `[Heading][]` link targets stay in parity. + */ +class AsciiHeadingIdsExtension implements ExtensionInterface +{ + /** + * @param bool|null $useIntl Force the transliteration engine; null auto-detects + * ext-intl (ICU) and otherwise uses the built-in baked map. + */ + public function __construct(protected ?bool $useIntl = null) + { + } + + public function register(DjotConverter $converter): void + { + $transliterator = new AsciiTransliterator($this->useIntl); + $transform = static fn (string $id): string => $transliterator->transliterate($id); + + // Renderer side (section ids). getHeadingIdTracker() only exists for HTML. + if ($converter->getRenderer() instanceof HtmlRenderer) { + $converter->getHeadingIdTracker()->setIdTransformer($transform); + } + + // Parser side (implicit [Heading][] reference resolution) - keeps the link + // targets identical to the rendered section ids. + $converter->getParser()->setHeadingIdTransformer($transform); + } +} diff --git a/src/Parser/BlockParser.php b/src/Parser/BlockParser.php index f4ba7ae2..2e2040cb 100644 --- a/src/Parser/BlockParser.php +++ b/src/Parser/BlockParser.php @@ -4,6 +4,7 @@ namespace Djot\Parser; +use Closure; use Djot\Exception\ParseException; use Djot\Exception\ParseWarning; use Djot\Node\Block\BlockQuote; @@ -193,6 +194,16 @@ class BlockParser */ protected bool $nestedListsWithoutBlankLine = false; + /** + * Optional heading-id transform, mirrored from the renderer's + * HeadingIdTracker so parser-side id computation (heading reference + * resolution) matches the rendered ids. Set by AsciiHeadingIdsExtension. + * + * @var \Closure|null + * @phpstan-var (\Closure(string): string)|null + */ + protected ?Closure $headingIdTransformer = null; + public function __construct( bool $collectWarnings = false, bool $strictMode = false, @@ -214,6 +225,18 @@ public function __construct( $this->fencedBlockParser = new FencedBlockParser(); } + /** + * Set the optional heading-id transform used by the heading-reference + * resolution pass, so parser-computed ids match the renderer's. Set by + * AsciiHeadingIdsExtension to keep ids and `[Heading][]` links in parity. + * + * @param \Closure(string): string|null $transformer + */ + public function setHeadingIdTransformer(?Closure $transformer): void + { + $this->headingIdTransformer = $transformer; + } + /** * Enable or disable significant newlines mode. * @@ -697,7 +720,7 @@ protected function extractAbbreviations(array $lines): void */ protected function extractHeadingReferences(array $lines): void { - $headingIdTracker = new HeadingIdTracker(); + $headingIdTracker = new HeadingIdTracker($this->headingIdTransformer); $pendingId = null; $count = count($lines); @@ -780,7 +803,7 @@ protected function extractHeadingReferences(array $lines): void */ protected function rewriteHeadingReferences(Document $document): void { - $tracker = new HeadingIdTracker(); + $tracker = new HeadingIdTracker($this->headingIdTransformer); $tracker->reserveExplicitIds($document); /** @var array $newUrlByLabel */ diff --git a/src/Renderer/HeadingIdTracker.php b/src/Renderer/HeadingIdTracker.php index 61bab9fb..5608d4fa 100644 --- a/src/Renderer/HeadingIdTracker.php +++ b/src/Renderer/HeadingIdTracker.php @@ -4,6 +4,7 @@ namespace Djot\Renderer; +use Closure; use Djot\Node\Block\Heading; use Djot\Node\Inline\Code; use Djot\Node\Inline\FootnoteRef; @@ -55,11 +56,25 @@ class HeadingIdTracker */ protected array $resolvedTexts = []; - protected AsciiTransliterator $transliterator; + /** + * @param \Closure(string): string|null $idTransformer Optional transform applied + * to the spec-normalized id (e.g. ASCII transliteration for URL/CSS + * portability). Null (default) leaves the jgm/djot#393 unicode-preserving id + * unchanged. Set via an extension such as AsciiHeadingIdsExtension. + */ + public function __construct(protected ?Closure $idTransformer = null) + { + } - public function __construct(?AsciiTransliterator $transliterator = null) + /** + * Set the optional id transform (see the constructor). Used by extensions to + * adjust generated ids without forking the core spec slugger. + * + * @param \Closure(string): string|null $idTransformer + */ + public function setIdTransformer(?Closure $idTransformer): void { - $this->transliterator = $transliterator ?? new AsciiTransliterator(); + $this->idTransformer = $idTransformer; } /** @@ -115,47 +130,45 @@ public function reserveExplicitIds(Node $node): void } /** - * Normalize text into a valid, link-safe CSS identifier string - * - * 1. Transliterate to ASCII (Über → Uber, café → cafe, Привет → Privet), - * so the ID survives being shared as a URL fragment through - * auto-linkers that truncate or mangle non-ASCII - * 2. Strip # characters entirely - * 3. Trim whitespace - * 4. Replace whitespace sequences with single dashes - * 5. Replace any remaining characters invalid in CSS identifiers - * (anything other than letters, numbers, hyphens, and underscores) - * with dashes - * 6. Collapse consecutive dashes and trim leading/trailing dashes - * 7. Prefix with 'h-' if the result starts with a digit, ensuring a valid - * CSS ident start (digits are not allowed as the first character) + * Normalize heading text into an identifier (jgm/djot#393) * - * Returns '' when nothing usable remains (e.g. all-punctuation text, or a - * script the transliterator cannot reduce to ASCII); the caller then - * falls back to a generated `s-N` id. + * 1. Slug the text: replace each maximal run of non-alphanumeric ASCII with a + * single '-' and trim; non-ASCII characters and letter case are preserved. + * 2. If an id transform is set (e.g. ASCII transliteration via + * AsciiHeadingIdsExtension), apply it to the slug and re-slug the result + * (the transform may reintroduce spaces/punctuation, e.g. romanization). + * 3. Prefix with 's-' if the result starts with a digit, so the id is a valid + * bare CSS selector (querySelector('#9-x') would otherwise throw). This is + * orthogonal to #393, which governs punctuation only. * - * Producing a valid CSS identifier ensures that consumers such as HTMX, - * which call `querySelector` with the section ID for scroll-restoration, - * do not throw a SyntaxError when headings contain inline code or special - * characters (e.g. `$this->t($key, $params = [], $fallback = '')`). + * Returns '' when nothing usable remains (all-punctuation text, or a transform + * that reduces the text to nothing); the caller then falls back to a generated + * `s-N` id. */ public function normalizeId(string $text): string { - $id = $this->transliterator->transliterate($text); - $id = str_replace('#', '', $id); - $id = trim($id); - $id = preg_replace('/\s+/u', '-', $id) ?? $id; - $id = preg_replace('/[^\p{L}\p{N}_-]+/u', '-', $id) ?? $id; - $id = preg_replace('/-{2,}/', '-', $id) ?? $id; - $id = trim($id, '-'); + $id = $this->slug($text); + + if ($this->idTransformer !== null) { + $id = $this->slug(($this->idTransformer)($id)); + } if ($id !== '' && preg_match('/^\p{N}/u', $id)) { - $id = 'h-' . $id; + $id = 's-' . $id; } return $id; } + /** + * jgm/djot#393 slug step: replace each maximal run of non-alphanumeric ASCII + * with a single '-' and trim. Non-ASCII characters and letter case are kept. + */ + protected function slug(string $text): string + { + return trim(preg_replace('/[^0-9A-Za-z\x{0080}-\x{10FFFF}]+/u', '-', $text) ?? $text, '-'); + } + /** * Get plain text content of a node * diff --git a/tests/TestCase/DjotConverterTest.php b/tests/TestCase/DjotConverterTest.php index e5ab9633..05a34499 100644 --- a/tests/TestCase/DjotConverterTest.php +++ b/tests/TestCase/DjotConverterTest.php @@ -7,6 +7,7 @@ use Djot\DjotConverter; use Djot\Event\RenderEvent; use Djot\Exception\ParseException; +use Djot\Extension\AsciiHeadingIdsExtension; use Djot\Extension\HeadingLevelShiftExtension; use Djot\Extension\TabsExtension; use Djot\Node\Block\Heading; @@ -1919,15 +1920,17 @@ public function testUnicodeInHeading(): void $djot = '# 日本語の見出し'; $result = $this->converter->convert($djot); - // The visible heading text is unchanged; only the ID is made - // ASCII-safe so it survives being shared as a URL fragment. + // jgm/djot#393: non-ASCII heading text is preserved in the id. $this->assertStringContainsString('

日本語の見出し

', $result); - $this->assertStringNotContainsString('id="日本語の見出し"', $result); - $this->assertMatchesRegularExpression('/
/', $result); + $this->assertStringContainsString('id="日本語の見出し"', $result); + // AsciiHeadingIdsExtension folds the id to ASCII for URL/CSS portability. + $asciiConverter = new DjotConverter(); + $asciiConverter->addExtension(new AsciiHeadingIdsExtension()); + $ascii = $asciiConverter->convert($djot); + $this->assertMatchesRegularExpression('/
/', $ascii); if (class_exists(Transliterator::class)) { - // With ext-intl the CJK heading is romanized rather than dropped. - $this->assertStringContainsString('
', $result); + $this->assertStringContainsString('
', $ascii); } } diff --git a/tests/TestCase/Extension/AsciiHeadingIdsExtensionTest.php b/tests/TestCase/Extension/AsciiHeadingIdsExtensionTest.php new file mode 100644 index 00000000..4e794a13 --- /dev/null +++ b/tests/TestCase/Extension/AsciiHeadingIdsExtensionTest.php @@ -0,0 +1,42 @@ +convert("# über café\n"); + + $this->assertStringContainsString('
', $html); + } + + public function testExtensionFoldsHeadingIdToAscii(): void + { + $converter = new DjotConverter(); + $converter->addExtension(new AsciiHeadingIdsExtension()); + + $html = $converter->convert("# über café\n"); + + $this->assertStringContainsString('
', $html); + } + + public function testExtensionKeepsImplicitReferenceInParity(): void + { + // The folded id must also be used by the `[Heading][]` link target, so the + // anchor still resolves (parser/renderer parity). + $converter = new DjotConverter(); + $converter->addExtension(new AsciiHeadingIdsExtension()); + + $html = $converter->convert("# über café\n\nsee [über café][]\n"); + + $this->assertStringContainsString('
', $html); + $this->assertStringContainsString('href="#uber-cafe"', $html); + } +} diff --git a/tests/TestCase/Extension/HeadingReferenceExtensionTest.php b/tests/TestCase/Extension/HeadingReferenceExtensionTest.php index 15cfcc6a..46902dfc 100644 --- a/tests/TestCase/Extension/HeadingReferenceExtensionTest.php +++ b/tests/TestCase/Extension/HeadingReferenceExtensionTest.php @@ -174,14 +174,15 @@ public function testHeadingWithSmartQuotesMatchesStraightQuoteReference(): void // The parser converts straight quotes to smart quotes in heading text, // but reference targets keep straight quotes. The extension normalizes - // quotes for matching so this should resolve correctly. + // quotes for matching so the reference still resolves; the resulting id + // keeps the smart quotes (jgm/djot#393 preserves non-ASCII characters). $html = $converter->convert(<<<'DJOT' See [[Say "Hello"]]. # Say "Hello" DJOT); - $this->assertStringContainsString('href="#Say-Hello"', $html); + $this->assertStringContainsString('href="#Say-“Hello”"', $html); $this->assertStringNotContainsString('[[Say "Hello"]]', $html); } @@ -210,7 +211,7 @@ public function testHeadingWithApostropheResolvesCorrectly(): void # Bob's Guide DJOT); - $this->assertStringContainsString('href="#Bob-s-Guide"', $html); + $this->assertStringContainsString('href="#Bob’s-Guide"', $html); $this->assertStringNotContainsString('data-heading-ref=', $html); $this->assertStringNotContainsString('[[Bob\'s Guide]]', $html); } diff --git a/tests/TestCase/Renderer/HeadingIdTrackerTest.php b/tests/TestCase/Renderer/HeadingIdTrackerTest.php index 22a27ba0..16f32bb9 100644 --- a/tests/TestCase/Renderer/HeadingIdTrackerTest.php +++ b/tests/TestCase/Renderer/HeadingIdTrackerTest.php @@ -175,38 +175,49 @@ public function testNormalizeId(): void $this->assertSame('Multiple-Spaces', $this->tracker->normalizeId('Multiple Spaces')); $this->assertSame('this-t-key-params-fallback', $this->tracker->normalizeId("\$this->t(\$key, \$params = [], \$fallback = '')")); $this->assertSame('My-title', $this->tracker->normalizeId('My --- title')); - // Non-ASCII is transliterated to keep shared anchors link-safe; the - // Latin/Cyrillic output is deterministic with or without ext-intl. - $this->assertSame('Privet-mir', $this->tracker->normalizeId('Привет мир')); + // jgm/djot#393: non-ASCII is preserved (case kept), not transliterated. + $this->assertSame('Привет-мир', $this->tracker->normalizeId('Привет мир')); $this->assertSame('', $this->tracker->normalizeId('###')); - $this->assertSame('h-123-Things', $this->tracker->normalizeId('123 Things')); - $this->assertSame('h-1-Introduction', $this->tracker->normalizeId('1. Introduction')); + $this->assertSame('s-123-Things', $this->tracker->normalizeId('123 Things')); + $this->assertSame('s-1-Introduction', $this->tracker->normalizeId('1. Introduction')); } /** - * Pins djot-php's heading-ID behaviour around jgm/djot#391. - * - * djot-php replaces (not removes) mid-word punctuation, additionally - * replaces apostrophes / quotes / `;` / `:` so IDs are valid CSS - * identifiers, and transliterates non-ASCII to ASCII so the IDs survive - * being shared as URL fragments through auto-linkers. All cases below - * are deterministic with or without ext-intl. + * Pins djot-php's heading-ID behaviour to jgm/djot#393: each maximal run of + * non-alphanumeric ASCII is replaced with `-` and leading/trailing `-` are + * trimmed. Case and non-ASCII characters (Cyrillic, accented Latin, smart + * quotes) are preserved; `_` is replaced (no longer an exception). A + * leading-digit result keeps the `h-` prefix for CSS-selector safety + * (orthogonal to #393). ASCII-folding is opt-in via AsciiHeadingIdsExtension. */ public function testNormalizeIdSpecAlignmentEdgeCases(): void { $this->assertSame('A-B-C', $this->tracker->normalizeId('A+B=C')); $this->assertSame('Emphasis-strong', $this->tracker->normalizeId('Emphasis/strong')); $this->assertSame('That-s-all', $this->tracker->normalizeId("That's all")); - $this->assertSame('That-s-all', $this->tracker->normalizeId('That’s all')); + $this->assertSame('That’s-all', $this->tracker->normalizeId('That’s all')); $this->assertSame('foo-bar', $this->tracker->normalizeId('foo...bar')); $this->assertSame('Uber-uns', $this->tracker->normalizeId('Uber uns')); - $this->assertSame('Uber-uns', $this->tracker->normalizeId('Über uns')); - $this->assertSame('cafe-resume', $this->tracker->normalizeId('café résumé')); - $this->assertSame('Strasse', $this->tracker->normalizeId('Straße')); - $this->assertSame('h-2024-recap', $this->tracker->normalizeId('2024 recap')); + $this->assertSame('Über-uns', $this->tracker->normalizeId('Über uns')); + $this->assertSame('café-résumé', $this->tracker->normalizeId('café résumé')); + $this->assertSame('Straße', $this->tracker->normalizeId('Straße')); + $this->assertSame('s-2024-recap', $this->tracker->normalizeId('2024 recap')); $this->assertSame('', $this->tracker->normalizeId('!!!')); } + /** + * An id transform (e.g. the one set by AsciiHeadingIdsExtension) is applied to + * the spec id; here it transliterates non-ASCII to ASCII for portability. + */ + public function testAsciiHeadingIdsOptInTransliterates(): void + { + $transliterator = new AsciiTransliterator(); + $ascii = new HeadingIdTracker(static fn (string $id): string => $transliterator->transliterate($id)); + $this->assertSame('Privet-mir', $ascii->normalizeId('Привет мир')); + $this->assertSame('Uber-uns', $ascii->normalizeId('Über uns')); + $this->assertSame('cafe-resume', $ascii->normalizeId('café résumé')); + } + public function testGetPlainText(): void { $heading = new Heading(2); @@ -370,28 +381,29 @@ public function testSymbolsRetainedInPlainText(): void } /** - * djot.js keeps `_` (it is not in its punctuation denylist) and it is a - * valid CSS identifier character, so djot-php keeps it too. This pins the - * deliberate divergence from the looser #393 spec prose. + * jgm/djot#393 removes the per-character exceptions: `_` is non-alphanumeric + * ASCII, so it is replaced with `-` like any other punctuation. */ - public function testUnderscoreRetainedInId(): void + public function testUnderscoreReplacedInId(): void { $heading = new Heading(2); $heading->appendChild(new Text('foo_bar baz')); $id = $this->tracker->getIdForHeading($heading); - $this->assertSame('foo_bar-baz', $id); + $this->assertSame('foo-bar-baz', $id); } /** - * When transliteration removes the entire heading text (a script outside - * the baked map, no ext-intl), the heading must fall back to a stable - * generated `s-N` id — not the legacy `heading` sentinel. + * With an ASCII-folding id transform, when transliteration removes the entire + * heading text (a script outside the baked map, no ext-intl), the heading must + * fall back to a stable generated `s-N` id. (By default the non-ASCII text is + * preserved instead, per #393, so no fallback occurs.) */ public function testHeadingThatTransliteratesToNothingGetsFallbackId(): void { - $tracker = new HeadingIdTracker(new AsciiTransliterator(useIntl: false)); + $transliterator = new AsciiTransliterator(useIntl: false); + $tracker = new HeadingIdTracker(static fn (string $id): string => $transliterator->transliterate($id)); $cjk = new Heading(2); $cjk->appendChild(new Text('日本語の見出し'));