Skip to content

Specify punctuation handling for auto-generated heading IDs#393

Open
karlb wants to merge 1 commit into
jgm:mainfrom
karlb:heading-auto-id
Open

Specify punctuation handling for auto-generated heading IDs#393
karlb wants to merge 1 commit into
jgm:mainfrom
karlb:heading-auto-id

Conversation

@karlb

@karlb karlb commented May 18, 2026

Copy link
Copy Markdown

Replace each maximal run of non-alphanumeric ASCII characters with - and trim leading/trailing -. This settles remove-vs-replace as replace and removes the previous exceptions for individual punctuation characters (#391).

Replace each maximal run of non-alphanumeric ASCII characters with
`-` and trim leading/trailing `-`.  This settles remove-vs-replace
as replace and removes the previous exceptions for individual
punctuation characters (jgm#391).
@dereuromark

dereuromark commented May 19, 2026

Copy link
Copy Markdown
Contributor

Non-ASCII punctuation / smart quotes in IDs — intended?

"non-alphanumeric ASCII" only targets ASCII, so any non-ASCII code point is preserved — including non-ASCII punctuation. With smart punctuation enabled (the default in many setups), straight quotes/apostrophes become U+2019 / U+201C / U+201D before ID generation, so:

Heading (smart punctuation on) Resulting ID (prose and djot.js)
# Bob's Guide Bob’s-Guide
# Say "Hello" Say-“Hello”
# Pros — and cons Pros-—-and-cons

These are valid CSS identifiers (code points ≥ U+0080 are fine), and consistent reference resolution still works, but Bob’s-Guide as a URL fragment is unexpected for very common input — and it's almost certainly not what the "non-alphanumeric ASCII" wording was trying to express. The intent of that phrasing seems to be "preserve non-ASCII letters" (international headings like # Über uns, # 日本語の見出し), not "preserve smart quotes and dashes".

Would it be worth distinguishing non-ASCII letters/digits (preserved) from non-ASCII punctuation and symbols (replaced like their ASCII counterparts)? That keeps # Über unsÜber-uns while giving # Bob's GuideBob-s-Guide, which matches what most users expect from a heading anchor.

An important consideration: Links copied and shared via websites, tools or alike will usually be "auto-linked".
But with non trivial anchor links this can easily become a broken link (cut off in between).

When an anchor leaves the page (copied into Slack/email/a chat/another markdown doc), it gets auto-linked by a heuristic, not a parser.
Non-ASCII in the fragment breaks this three ways:

  1. Truncation. Many autolinkers stop the URL at the first "non-URL-looking" byte. …/page#Bob’s-Guide → the link often ends at #Bob (the ’
    is the cut point). Silent broken link, looks fine to the sender.
  2. Percent-encode bloat + cut-off. Tools that do encode turn ’ into %E2%80%99, 日 into %E6%97%A5. A CJK heading becomes a long %XX%XX%XX…
    blob; some tools cap autolink length or break on the %. #Bob’s-Guide → #Bob%E2%80%99s-Guide.
  3. Normalization mismatch. The sender's ’ (U+2019) or an NFC/NFD difference gets re-normalized differently by the receiving app, so the
    pasted anchor no longer === the page's id. Scroll silently fails.

As such, the safest option would be something at least close to ASCII.

Minor: empty / all-punctuation headings

The spec doesn't state what happens when the normalized text is empty (# !!!, # ###, an empty heading). djot.js falls back to the s / s-1 family; the prose says nothing. A one-line clarification ("if the result is empty, use s plus a numeric suffix") would let ports match without reverse-engineering the reference.

Downstream context (why we care)

In djot-php, heading IDs are consumed by querySelector(...) (HTMX scroll restoration, anchor lookups, JS-driven TOCs). getElementById tolerates almost anything, but querySelector('#…') does not without CSS.escape(), which in practice nobody does. So we already deliberately deviate: we replace the ' " ; : set, prefix a leading digit with h-, and use an s-N empty fallback — all documented as intentional. We're not asking the spec to adopt our deviations; we're asking for the prose and reference to agree, and for (2) to be pinned down, so the ports can converge instead of each carrying a different asterisk.

TLDR
I lean towards more php-collective/djot-php#183

@karlb

karlb commented May 20, 2026

Copy link
Copy Markdown
Author

With smart punctuation enabled (the default in many setups), straight quotes/apostrophes become U+2019 / U+201C / U+201D before ID generation

I don't think this should be the case. From the spec:

The identifier is formed by taking the plain text content of the heading

I read that as taking the input from the source before any further processing and it also seems to be what djot.js does.

Would it be worth distinguishing non-ASCII letters/digits (preserved) from non-ASCII punctuation and symbols (replaced like their ASCII counterparts)?

I would try hard not to interpret unicode unless strictly necessary. AFAICS, Djot avoided doing that successfully so far.

When an anchor leaves the page (copied into Slack/email/a chat/another markdown doc), it gets auto-linked by a heuristic, not a parser.

When I copy non-ASCII links from my browser, they get percent encoded. Which case do you have in mind that breaks? Copying from the HTML source?

@jgm

jgm commented May 20, 2026

Copy link
Copy Markdown
Owner

Yes, see design goal #6 in the README.

@dereuromark

Copy link
Copy Markdown
Contributor

When I copy non-ASCII links from my browser, they get percent encoded

The length is dramatically increased, and to a point its unreadable.

Yes, see design goal #6 in the README.

Indeed. For any internal stuff I agree. But here, we deal with external (web) URLs and we need to be able to allow copying them and distributing them through a variety of outside systems without breaking too easily.

@karlb

karlb commented May 22, 2026

Copy link
Copy Markdown
Author

When I copy non-ASCII links from my browser, they get percent encoded

The length is dramatically increased, and to a point its unreadable.

Which alternative do you prefer? Replace all non-ASCII characters, so that CJK languages just get s-1, s-2, etc as heading ids? That would safely pass through all systems, but would lack any information about the link target and could silently link to wrong items when the heading structure changes. I would say showing percent-encoded IDs is undesirable but keeping the characters will give safer and more informative (when decoded) IDs. So I wonder, how many cases are there where:

  • non-ASCII links are correctly used and shown to the user without encoding
  • percent-encoded links are converted back to readable text (e.g. when following a link in the browser the URL is shown decoded)
    because these are the cases where keeping the non-ASCII part of the heading IDs would be beneficial.

How relevant is this for non-HTML output? Any specific formats that have user-visible heading IDs that we should keep in mind?

With design goal 6, I don't see much choice other than keeping all or replacing all non-ASCII. So far, I still prefer the PR as-is (keep non-ASCII).

@dereuromark

dereuromark commented May 22, 2026

Copy link
Copy Markdown
Contributor

Which alternative do you prefer? Replace all non‑ASCII … so that CJK languages just get s-1, s-2?

I don't think that's the only alternative to "keep everything". There's a third option that I think is actually the right one and it's what we ended up landing in php‑djot#183: transliteration. Concretely, run Any-Latin; Latin-ASCII (or its baked, table‑driven equivalent for environments without ICU), then run your existing ASCII rule on top:

Heading Spec‑literal id (this PR) Transliterated id
# Über uns Über-uns Uber-uns
# café résumé café-résumé cafe-resume
# Привет мир Привет-мир Privet-mir
# 日本語の見出し 日本語の見出し ri-ben-yuno-jian-chushi (intl) / s-N (fallback)
# Bob's Guide (smart‑punct on) Bob's-Guide Bob-s-Guide

That keeps the information content of the heading (so refs are still meaningful and stable across reordering, unlike s-N), and it makes the resulting URL fragment survive the share‑and‑click path that an anchor is for. We baked the Latin/Cyrillic/punctuation subset into a static table generated from ICU and pinned engine parity in tests, so that part is byte‑deterministic with or without ext-intl — which is the property you'd want the spec to give you anyway. (Romanizing exotic scripts like CJK is the one tier that still needs ICU at runtime and tracks the ICU version; without it those headings fall back to s-N.)

How many cases are there where non‑ASCII links are correctly used and shown to the user without encoding / percent‑encoded links are converted back to readable text?

The honest framing: non‑ASCII anchors work in the browser. The browser URL bar percent‑encodes on copy and decodes on display, and getElementById doesn't care. So if your use case is "click anchor on the page I'm currently reading", everything is fine and there's nothing to fix.

But that's not the primary use case of a heading id. The primary use case is the link leaving the page: pasted into Slack/Discord/Teams/WhatsApp, into an email body, into a Jira/Linear/Asana comment, into another markdown doc, into an LLM chat, into a tweet. None of those parse URLs — they autolink heuristically. Failure modes I've actually hit (not theoretical):

  • Truncation at the first non‑ASCII byte (…#Bob for …#Bob's-Guide) — silent dead link, looks correct to the sender.
  • Percent‑encoded blob length caps in chat previews (CJK headings become 30‑char %E…%E… runs and get cut).
  • NFC/NFD round‑trips between macOS sender and Linux/Windows receiver where ' (U+2019) on one side ≠ ' on the other side, so the fragment no longer equals any id on the page. Scroll silently fails.
  • AI ingestion: an LLM asked to "open" a pasted URL will frequently chop the non‑ASCII tail off the fragment when it normalizes the string, again silently.

So the set of cases where the non‑ASCII id is beneficial is narrow (browser address‑bar round‑trip on the same page). The set where it's harmful is "every time the link is shared", which is most of the lifetime of a deep link.

How relevant is this for non‑HTML output?

Quite. A few I've run into:

  • LaTeX: \label/\ref keys are conventionally ASCII; non‑ASCII compiles only with inputenc+hyperref configured exactly right, and named PDF destinations are ASCII by convention.
  • EPUB / OPF: id is an XML NCName — which technically permits many Unicode letters, so it's a tolerance problem, not a validity one: most readers cope, several don't, and EPUBcheck warns on some.
  • DocBook / Pandoc roundtrips: cross‑refs get mangled or warned about.
  • Sitemaps / RSS / Atom: URLs must be IRI‑normalized; ids that don't percent‑encode cleanly produce validator warnings.

It's the same underlying issue as the share case: anything that isn't a browser tends to treat the fragment as ASCII or warn.

With design goal 6, I don't see much choice other than keeping all or replacing all non‑ASCII.

I'd push back gently on the framing. The current rule already implicitly interprets Unicode — "non‑alphanumeric ASCII" means "leave the entire non‑ASCII Unicode space alone", which is itself a classification decision; it's just the most permissive one. So the spec already takes a position on Unicode, it just takes the easiest one.

Two ways forward that don't violate goal 6:

  1. Implementation‑defined post‑step. Mandate the ASCII rule (this PR), and say explicitly that implementations may transliterate or strip non‑ASCII before applying it. That keeps the spec Unicode‑free while letting ports converge on portable output. Right now every port is going to deviate silently; making the deviation legible would help interop more than nailing it down would.
  2. Static, published table. If you ever do want to specify it, ICU's Any-Latin; Latin-ASCII is a fixed published transform — you can bake it into a table and ship it as a normative file the same way Unicode itself ships data files. The spec text stays "apply this table, then the ASCII rule"; the spec doesn't have to know about Unicode beyond "here is a finite map".

The spec doesn't state what happens when the normalized text is empty (# !!!, # ###, an empty heading).

Worth resolving on this PR regardless of which direction you take on the non‑ASCII question — it's a real interop gap. One‑liner like "if the resulting id is empty, implementations generate a fallback id such as s followed by the heading's index" matches what djot.js does today and lets ports stop reverse‑engineering it.


Net of all that: I'd still gently lean against merging this PR #393 as written, because by specifying only the ASCII characters it leaves "preserve all non‑ASCII punctuation" as the de‑facto normative behavior — which is the one piece I think is unambiguously not what most users want. If you want the spec to stay Unicode‑agnostic, leaving non‑ASCII handling implementation‑defined (option 1 above) is the lighter‑touch fix and lets transliterating ports like php‑djot stay conformant.

So 👍 from me.

@karlb

karlb commented May 22, 2026

Copy link
Copy Markdown
Author

This PR does not change anything about non-ASCII characters compared to previous version. So we can merge it without any regression in that area, if we agree on the ASCII part. That should help to keep the discussions focused.

@jgm

jgm commented May 22, 2026

Copy link
Copy Markdown
Owner

I'd push back gently on the framing. The current rule already implicitly interprets Unicode — "non‑alphanumeric ASCII" means "leave the entire non‑ASCII Unicode space alone", which is itself a classification decision; it's just the most permissive one. So the spec already takes a position on Unicode, it just takes the easiest one.

To recognize whether a character is in the ASCII set, you just need <= 127, and you don't even need to recognize Unicode code points; assuming UTF-8, you can just check the byte. Recognizing character classes (e.g. telling alphanumerics from symbols) is an entirely different matter. Let alone transliteration! (Even with all the Unicode data, that will only help you with accented characters, not with Russian, Chinese, Greek, or even Turkish.)

Yes, it is the easiest classifying position. That is the design goal -- to be easy -- not to require parsers to have access to huge tables of Unicode data, or libraries that include these.

I would say: as far as djot is concerned, identifiers with non-ASCII characters are fine. They are explicitly allowed in HTML. If your target format has stricter requirements, e.g. ASCII-only, or must start with a letter, you can easily handle this with an AST transformation or a transformation in the writer. (That's what pandoc does, for example, with LaTeX.)

I would also say that it is almost always better to use explicit identifiers if you plan to link to them or use them in formatting rules. That way, things won't break if you change the wording of the section title, or add another section ahead of it with the same title.

@jgm

jgm commented May 22, 2026

Copy link
Copy Markdown
Owner

Another option would be to remove the description of automatic identifiers from the spec. This would allow implementations to provide such identifiers using whatever scheme makes most sense to them...or not to generate them at all (though it would still be necessary to generate identifiers for any sections that are linked to using the [Section title][] format). In this case, portable djot would need to use explicit identifiers whenever links to headings are used.

@dereuromark

dereuromark commented May 22, 2026

Copy link
Copy Markdown
Contributor

I am with @karlb on the middle ground for now. It at least allows a bit less divergence across implementations.
So totally fine with the PR as is.

And true: for anything you link across systems, explicit {#id} is the robust answer regardless.

dereuromark added a commit to php-collective/djot-php that referenced this pull request Jun 6, 2026
…x to s-

Replace the baked-in asciiHeadingIds bool with a pluggable id transform on
HeadingIdTracker (Closure(string): string). The core stays pure jgm/djot#393
(unicode-preserving); ASCII folding now ships as AsciiHeadingIdsExtension, which
sets the transform on both the renderer's tracker and the parser's heading-reference
resolution so section ids and [Heading][] link targets stay in parity.

The transform runs over the spec id and is re-slugged afterwards, so a transform
that reintroduces separators (e.g. CJK romanization "ri ben yu") still yields a
clean id.

Also unify the leading-digit prefix from h- to s- so it matches the empty-heading
s-N fallback (one prefix convention).
dereuromark added a commit to php-collective/djot-php that referenced this pull request Jun 6, 2026
Rewrite the heading-id reference section for the jgm/djot#393 default (letter case
and non-ASCII preserved, s- prefix for leading digits, s-N empty fallback) and
document AsciiHeadingIdsExtension as the opt-in ASCII-folding path, including its
parser/renderer parity and the note that registration order does not matter.
dereuromark added a commit to php-collective/djot-php that referenced this pull request Jun 6, 2026
…ension (#224)

* Align heading-id slugs with jgm/djot#393; add opt-in asciiHeadingIds

Default heading-id generation now follows the settled #393 rule: replace each
maximal run of non-alphanumeric ASCII with a single '-' and trim leading/trailing
'-', preserving letter case and all non-ASCII characters. This drops the previous
always-on ASCII transliteration and the '_' exception, so 'Über café' becomes
'Über-café', 'under_score' becomes 'under-score', 'a--b' becomes 'a-b'. A
leading-digit result keeps the 'h-' prefix for CSS-selector safety (orthogonal to
#393, which governs punctuation only).

ASCII transliteration is now opt-in via a new asciiHeadingIds option on
DjotConverter, threaded to both the renderer and the parser's reference-resolution
pass so heading ids stay in parity. With it enabled, 'Über café' becomes
'uber-cafe'.

The official djot test suite is unchanged (its id cases are simple ASCII).

* Add asciiHeadingIds to DjotConverter constructor docblock (phpcs)

* Make ASCII heading ids a pluggable transform + extension; unify prefix to s-

Replace the baked-in asciiHeadingIds bool with a pluggable id transform on
HeadingIdTracker (Closure(string): string). The core stays pure jgm/djot#393
(unicode-preserving); ASCII folding now ships as AsciiHeadingIdsExtension, which
sets the transform on both the renderer's tracker and the parser's heading-reference
resolution so section ids and [Heading][] link targets stay in parity.

The transform runs over the spec id and is re-slugged afterwards, so a transform
that reintroduces separators (e.g. CJK romanization "ri ben yu") still yields a
clean id.

Also unify the leading-digit prefix from h- to s- so it matches the empty-heading
s-N fallback (one prefix convention).

* Fix @var annotation phpcbf mangled into invalid PHPStan syntax

* docs: heading-id #393 default + AsciiHeadingIdsExtension

Rewrite the heading-id reference section for the jgm/djot#393 default (letter case
and non-ASCII preserved, s- prefix for leading digits, s-N empty fallback) and
document AsciiHeadingIdsExtension as the opt-in ASCII-folding path, including its
parser/renderer parity and the note that registration order does not matter.

* Restore heading-id transformer Closure signature in a phpcbf-safe form

The headingIdTransformer property lost its (string): string signature when
phpcbf mangled the single-line annotation (6ea8855) into invalid syntax and the
workaround dropped to a bare \Closure|null. Restore the signature with the
two-line var + phpstan-var pattern already used by FrontmatterExtension: the
plain var stays \Closure|null (phpcbf leaves it untouched) and the parenthesized
phpstan-var carries the full (\Closure(string): string)|null. Verified phpcbf no
longer mangles it; phpcs and phpstan are clean.
dereuromark added a commit to markup-carve/carve that referenced this pull request Jun 6, 2026
Re-vendor carve-lib from the #393 carve-js build and regenerate the heading-id
corpus examples. Auto ids now preserve letter case and all non-ASCII characters,
replace each maximal run of non-alphanumeric ASCII with a single '-', and use an
's-' prefix for a leading digit (previously lowercase + ASCII-fold + 'section-').

Add a NORMATIVE heading-identifier rule to resources/grammar.ebnf and update the
case-study and feature docs. ASCII folding becomes an opt-in implementation mode
(carve-js 'asciiHeadingIds' parse option; carve-php 'AsciiHeadingIdsExtension').

Part of the heading-id slug resolution (#71).
dereuromark added a commit to markup-carve/carve that referenced this pull request Jun 6, 2026
#73)

#72 adopted jgm/djot#393's case-preserving ids, but case preservation broke
</#id> cross-references: a lowercase ref no longer matched a case-preserved id
(e.g. `# Getting Started` -> `Getting-Started`, so `</#getting-started>` failed).

Switch carve to LOWERCASE (Unicode-aware; non-ASCII characters preserved, only
their case folded) - the GitHub/SSG convention authors expect for anchors, which
makes ids and the common cross-reference case-insensitive with no special lookup.

Re-vendor the lowercase carve-lib, regenerate the heading-id corpus examples,
and update the normative grammar rule + case-study/feature docs. carve now
deliberately diverges from djot-php (case-preserving per #393); both keep the
opt-in ASCII fold (carve-js asciiHeadingIds; carve-php AsciiHeadingIdsExtension).

Refs #71.
dereuromark added a commit to markup-carve/carve-js that referenced this pull request Jun 7, 2026
Aligns slugify with carve spec #73: the auto heading-id rule now applies
the jgm/djot#393 run-replacement (each maximal run of non-alphanumeric
ASCII -> single '-', trimmed) and then lowercases (Unicode-aware), keeping
non-ASCII verbatim. Lowercasing makes ids and the common </#id> / [Heading][]
cross-references case-insensitive with no special lookup logic.

ASCII transliteration is no longer the default: it moves behind an opt-in
slugify(text, true), surfaced as the asciiHeadingIds parse option (carveToHtml
/ resolve). Empty-text and digit-leading slugs use the s / s- fallback.

Bumps the spec submodule to the #73 lowercase corpus and updates the slug
unit tests accordingly.
dereuromark added a commit to markup-carve/carve-php that referenced this pull request Jun 7, 2026
Aligns the heading-id slug with carve spec #73. normalizeId now applies
the jgm/djot#393 run rule (each maximal run of non-alphanumeric ASCII ->
single '-', trimmed) and then lowercases (Unicode-aware), keeping non-ASCII
characters verbatim. Empty-text and digit-leading slugs use the s / s-
fallback. This makes ids and </#id> / [Heading][] cross-references
case-insensitive with no special lookup logic.

ASCII transliteration is no longer the default. It moves behind a new
AsciiHeadingIdsExtension, which sets a slug transformer on both the
render-time tracker and the parse-time tracker (so implicit [Heading][]
references resolve to the folded ids).

Also fixes renderHeadingRef: an unresolved </#id> now renders as its
literal source text instead of a dangling self-link, matching the spec and
carve-js. This surfaced once ASCII refs stopped matching non-ASCII ids.

Bumps the spec submodule to the #73 lowercase corpus and promotes the new
81-list-lazy-continuation category (already produced byte-identically).
dereuromark added a commit to markup-carve/carve that referenced this pull request Jun 21, 2026
A docs-wide clarity pass turned up factual drift between the case-study narrative
and the current grammar/corpus:

- case-study/syntax.md (heading ids): the algorithm + examples said ids are
  NFC-normalized and lowercased by default and "deliberately diverge from djot".
  The grammar (PART 9, jgm/djot#393) and corpus are the opposite: ids are
  CASE-PRESERVING with NO Unicode normalization, cross-refs resolve
  case-insensitively, and lowercasing is an opt-in transform. Rewrite the steps,
  the example table (Getting-Started, Café-Crème, Über-uns, s-2024-Recap, …),
  and the framing to match. (comparison.md / divergence-from-djot.md were
  already correct; syntax.md was the lone stale holdout.)
- extensions.md: the footnote note implied inline footnotes are deferred. Inline
  `^[content]` is implemented Tier-1 core; only the sidenote `[>content]` form is
  deferred. Clarify.
- native-features-analysis.md: label inline footnotes Tier-1 core (a carve
  addition), not "extension".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants