Skip to content

ASCII-safe heading IDs (transliteration; intl + deterministic baked-map fallback)#183

Merged
dereuromark merged 1 commit into
masterfrom
ascii-safe-heading-ids
May 20, 2026
Merged

ASCII-safe heading IDs (transliteration; intl + deterministic baked-map fallback)#183
dereuromark merged 1 commit into
masterfrom
ascii-safe-heading-ids

Conversation

@dereuromark

@dereuromark dereuromark commented May 19, 2026

Copy link
Copy Markdown
Contributor

Draft. Makes auto-generated heading IDs ASCII-safe so they survive being shared as URL fragments. Companion to #182 (the spec-literal opposite, kept as a discussion artifact). Rebased on top of #184 (reservation-of-explicit-ids architecture), so this PR now sits cleanly as the link-safety layer on top of the dedupe-correctness layer.

The problem (status quo)

Heading IDs become URL fragments (https://example.com/page#Über uns). They leave the page and get pasted into chats, email, other markdown, issue trackers, social posts — anywhere a URL is detected heuristically by an auto-linker, not parsed by a real URL parser. Non-ASCII fragments break in three different ways at that boundary, and most users discover only after the link is sent:

Source heading (djot) Status quo ID (master / djot.js) What happens when the URL is shared
# Über uns id="Über-uns"…#Über-uns Many autolinkers stop the URL at the first non-ASCII byte → recipient gets …# (silent dead link, looks fine to sender)
# café résumé id="café-résumé"…#café-résumé Same truncation, plus mail clients that do encode produce …#caf%C3%A9-r%C3%A9sum%C3%A9 — verbose, sometimes truncated again by length caps
# Bob's Guide (smart-punct on) id="Bob’s-Guide"…#Bob’s-Guide (U+2019) is percent-encoded inconsistently across receivers; sender's NFC ≠ receiver's NFD on some macOS/iOS roundtrips → fragment doesn't match the page's id → silent scroll fail
# Привет мир id="Привет-мир"…#Привет-мир Encoded fragment is …#%D0%9F%D1%80%D0%B8%D0%B2%D0%B5%D1%82-%D0%BC%D0%B8%D1%80 — works in browsers but trips length-caps in chat previews, and <a>-link autodetection in plain-text aggregators frequently misses it entirely
# 日本語の見出し id="日本語の見出し"…#日本語の見出し Worst case: routinely mangled across messaging apps, treated as non-URL by Slack/Discord plain-text URL detection, and CJK input methods on the receiving end can re-normalize the codepoints

The CSS angle (#Über-uns is a valid CSS identifier — querySelector works after CSS.escape() — but smart quotes / colons / apostrophes are not, which is why master already replaces '/"/;/: with -) is the secondary failure mode. The dominant failure for a markup language is the share-and-click path described above.

This is getElementById-safe but share-unsafe. Reducing IDs to ASCII is the only fix that doesn't depend on every downstream tool getting URL detection right.

What this PR does

Source heading Result with this PR
# Über uns id="Uber-uns"
# café résumé id="cafe-resume"
# Bob's Guide (smart quotes) id="Bob-s-Guide"
# Привет мир id="Privet-mir"
# 日本語の見出し (with ext-intl) id="Ri-ben-yuno-jian-chushi" (romanized)
# 日本語の見出し (no ext-intl) id="s-N" (stable generated id)

All ASCII, all URL-safe in every encoding, all stable across copy-paste and autolinkers.

Implementation

  • New Djot\Renderer\AsciiTransliterator:
    • ICU Transliterator (Any-Latin; Latin-ASCII) when ext-intl is present — also romanizes non-Latin scripts (CJK, Cyrillic, …).
    • Baked Unicode→ASCII map (src/Renderer/ascii_translit_map.php) as the deterministic fallback when ext-intl is absent.
  • The map is generated from the same ICU transform (bin/gen-translit-map.php), and a script is baked only if every code point in it transliterates context-free (verified standalone, doubled, and between Latin letters). Latin / Cyrillic / punctuation / smart quotes / dashes → byte-identical with or without ext-intl. Greek (context-sensitive in ICU: αυau but υy) → excluded wholesale; degrades like CJK.
  • HeadingIdTracker::normalizeId() transliterates first, then runs the existing pipeline (keeps _, h- digit prefix). $baseId === '' (post-transliterate empty, post-normalize empty, or all-ASCII-punctuation) routes to the existing s-N fallback, which (since Reserve explicit ids before auto heading ids; rewrite refs post-parse #184) skips reserved s-N ids so it never collides with an explicit {#s-N}.
  • ext-intl added as composer suggest (recommended, not required).

Cross-environment determinism

The whole point of the PR is stable shared links, so the engine choice must not change the rendered id for the same input. The generator's all-or-nothing-per-script gate guarantees that — pinned by AsciiTransliteratorTest::testCoveredRangesAreEngineIndependent, which runs every covered case through both engines and asserts byte-equality. Exotic scripts (CJK, Arabic, Greek) deliberately diverge by tier (intl romanizes; fallback emits s-N); their exact intl output is gated on ext-intl in tests so CI is green with or without it.

Relationship to #184 (now in master)

#184 landed the architectural fix this PR previously documented as an out-of-scope follow-up:

  • Every explicit {#id} (heading or non-heading, anywhere in the document) is reserved up-front via reserveExplicitIds.
  • The s-N fallback dedupes against usedIds (do { } while (isset(...))), so a heading that transliterates-or-normalizes to empty never collides with an explicit {#s-N}.
  • The implicit-reference map and any built reference Link/Image nodes are re-targeted post-parse against the AST-correct deduped ids — so [Heading][] / ![Heading][] keep pointing at the renderer-visible section id even when transliteration changes that id (# Bob's GuideBob-s-Guide, ref auto-rewritten to #Bob-s-Guide).

Net effect: link-safety (this PR) and dedupe-correctness (#184) compose cleanly, no overlap, no leftover edges.

Tests / docs

  • TDD throughout. AsciiTransliteratorTest covers engine parity, ASCII invariant, fallback-drop, intl romanization, non-ASCII separator preservation (U+3000/U+3001), and the broken-ICU Transliterator::create()→null path.
  • HeadingIdTrackerTest and DjotConverterTest::testUnicodeInHeading updated to ASCII-safe expectations (engine-robust, intl-gated where exotic-script romanization is involved).
  • docs/reference/enhancements.md "CSS-Safe Heading IDs" section rewritten — adds "Why ASCII", the engine/determinism note, and a revised Spec Alignment table.

Verification

  • phpunit: 2192 tests, 5446 assertions — full suite green (rebased onto master post-Reserve explicit ids before auto heading ids; rewrite refs post-parse #184).
  • phpstan: no errors.
  • phpcs: clean.
  • codex review: six iterations across the original branch; every real finding fixed with TDD (intl-null map fallback, separator boundaries, context-sensitive Greek, parser/render parity for the empty-fallback case). The remaining finding from the rebased branch (smart-punctuation ["Foo"][] claim) was verified to be a false positive — heading text and link text go through the same inline-parser pass, so the keys match by construction.

@dereuromark dereuromark added documentation Improvements or additions to documentation enhancement New feature or request labels May 19, 2026
@codecov

codecov Bot commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.89328% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.98%. Comparing base (7cd1cff) to head (71500a0).

Files with missing lines Patch % Lines
src/Renderer/AsciiTransliterator.php 96.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master     #183      +/-   ##
============================================
+ Coverage     91.12%   91.98%   +0.86%     
- Complexity     3354     3364      +10     
============================================
  Files            99      101       +2     
  Lines          8575     9508     +933     
============================================
+ Hits           7814     8746     +932     
- Misses          761      762       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Heading IDs become URL fragments that get copied into chat, email and
other documents, where auto-linkers re-detect the URL heuristically.
Non-ASCII fragments are routinely truncated, percent-encoded
inconsistently, or re-normalized so the pasted fragment no longer
matches the page's id. Reducing IDs to ASCII keeps shared deep links
robust.

Add AsciiTransliterator: ICU Transliterator (Any-Latin; Latin-ASCII)
when ext-intl is available, otherwise a baked Unicode-to-ASCII map.
The map is generated by bin/gen-translit-map.php from the same ICU
transform, and a script is baked only if every code point in it is
context-free (verified standalone, doubled, and between Latin
letters). So Latin, Cyrillic, punctuation, smart quotes and dashes
are byte-identical with or without ext-intl, keeping anchors stable
across environments. Context-sensitive scripts (Greek) and non-Latin
scripts (CJK, Arabic) are romanized with ext-intl and otherwise fall
back to a generated s-N id, exactly like an empty heading.

normalizeId() transliterates first; the existing CSS-safe pipeline is
unchanged. normalizeId() now returns '' (instead of a literal
sentinel) when nothing usable remains, and the caller falls back to a
purely order-based s-N id so the implicit-reference pass and the
renderer stay consistent.

ext-intl added as a composer suggest (recommended for non-Latin
romanization, not required).
@dereuromark dereuromark force-pushed the ascii-safe-heading-ids branch from 4b664d8 to 71500a0 Compare May 20, 2026 00:26
@dereuromark dereuromark merged commit 5017e30 into master May 20, 2026
6 checks passed
@dereuromark dereuromark deleted the ascii-safe-heading-ids branch May 20, 2026 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant