Specify punctuation handling for auto-generated heading IDs by karlb · Pull Request #393 · jgm/djot

karlb · 2026-05-18T07:50:22Z

Replace each maximal run of non-alphanumeric ASCII characters with - and trim leading/trailing -. This settles remove-vs-replace as replace and removes the previous exceptions for individual punctuation characters (#391).

Replace each maximal run of non-alphanumeric ASCII characters with `-` and trim leading/trailing `-`. This settles remove-vs-replace as replace and removes the previous exceptions for individual punctuation characters (jgm#391).

dereuromark · 2026-05-19T17:21:25Z

Non-ASCII punctuation / smart quotes in IDs — intended?

"non-alphanumeric ASCII" only targets ASCII, so any non-ASCII code point is preserved — including non-ASCII punctuation. With smart punctuation enabled (the default in many setups), straight quotes/apostrophes become U+2019 / U+201C / U+201D before ID generation, so:

Heading (smart punctuation on)	Resulting ID (prose and djot.js)
`# Bob's Guide`	`Bob’s-Guide`
`# Say "Hello"`	`Say-“Hello”`
`# Pros — and cons`	`Pros-—-and-cons`

These are valid CSS identifiers (code points ≥ U+0080 are fine), and consistent reference resolution still works, but Bob’s-Guide as a URL fragment is unexpected for very common input — and it's almost certainly not what the "non-alphanumeric ASCII" wording was trying to express. The intent of that phrasing seems to be "preserve non-ASCII letters" (international headings like # Über uns, # 日本語の見出し), not "preserve smart quotes and dashes".

Would it be worth distinguishing non-ASCII letters/digits (preserved) from non-ASCII punctuation and symbols (replaced like their ASCII counterparts)? That keeps # Über uns → Über-uns while giving # Bob's Guide → Bob-s-Guide, which matches what most users expect from a heading anchor.

An important consideration: Links copied and shared via websites, tools or alike will usually be "auto-linked".
But with non trivial anchor links this can easily become a broken link (cut off in between).

When an anchor leaves the page (copied into Slack/email/a chat/another markdown doc), it gets auto-linked by a heuristic, not a parser.
Non-ASCII in the fragment breaks this three ways:

Truncation. Many autolinkers stop the URL at the first "non-URL-looking" byte. …/page#Bob’s-Guide → the link often ends at #Bob (the ’
is the cut point). Silent broken link, looks fine to the sender.
Percent-encode bloat + cut-off. Tools that do encode turn ’ into %E2%80%99, 日 into %E6%97%A5. A CJK heading becomes a long %XX%XX%XX…
blob; some tools cap autolink length or break on the %. #Bob’s-Guide → #Bob%E2%80%99s-Guide.
Normalization mismatch. The sender's ’ (U+2019) or an NFC/NFD difference gets re-normalized differently by the receiving app, so the
pasted anchor no longer === the page's id. Scroll silently fails.

As such, the safest option would be something at least close to ASCII.

Minor: empty / all-punctuation headings

The spec doesn't state what happens when the normalized text is empty (# !!!, # ###, an empty heading). djot.js falls back to the s / s-1 family; the prose says nothing. A one-line clarification ("if the result is empty, use s plus a numeric suffix") would let ports match without reverse-engineering the reference.

Downstream context (why we care)

In djot-php, heading IDs are consumed by querySelector(...) (HTMX scroll restoration, anchor lookups, JS-driven TOCs). getElementById tolerates almost anything, but querySelector('#…') does not without CSS.escape(), which in practice nobody does. So we already deliberately deviate: we replace the ' " ; : set, prefix a leading digit with h-, and use an s-N empty fallback — all documented as intentional. We're not asking the spec to adopt our deviations; we're asking for the prose and reference to agree, and for (2) to be pinned down, so the ports can converge instead of each carrying a different asterisk.

TLDR
I lean towards more php-collective/djot-php#183

karlb · 2026-05-20T07:31:28Z

With smart punctuation enabled (the default in many setups), straight quotes/apostrophes become U+2019 / U+201C / U+201D before ID generation

I don't think this should be the case. From the spec:

The identifier is formed by taking the plain text content of the heading

I read that as taking the input from the source before any further processing and it also seems to be what djot.js does.

Would it be worth distinguishing non-ASCII letters/digits (preserved) from non-ASCII punctuation and symbols (replaced like their ASCII counterparts)?

I would try hard not to interpret unicode unless strictly necessary. AFAICS, Djot avoided doing that successfully so far.

When an anchor leaves the page (copied into Slack/email/a chat/another markdown doc), it gets auto-linked by a heuristic, not a parser.

When I copy non-ASCII links from my browser, they get percent encoded. Which case do you have in mind that breaks? Copying from the HTML source?

jgm · 2026-05-20T09:27:56Z

Yes, see design goal #6 in the README.

dereuromark · 2026-05-20T15:14:26Z

When I copy non-ASCII links from my browser, they get percent encoded

The length is dramatically increased, and to a point its unreadable.

Yes, see design goal #6 in the README.

Indeed. For any internal stuff I agree. But here, we deal with external (web) URLs and we need to be able to allow copying them and distributing them through a variety of outside systems without breaking too easily.

karlb · 2026-05-22T14:29:35Z

When I copy non-ASCII links from my browser, they get percent encoded

The length is dramatically increased, and to a point its unreadable.

Which alternative do you prefer? Replace all non-ASCII characters, so that CJK languages just get s-1, s-2, etc as heading ids? That would safely pass through all systems, but would lack any information about the link target and could silently link to wrong items when the heading structure changes. I would say showing percent-encoded IDs is undesirable but keeping the characters will give safer and more informative (when decoded) IDs. So I wonder, how many cases are there where:

non-ASCII links are correctly used and shown to the user without encoding
percent-encoded links are converted back to readable text (e.g. when following a link in the browser the URL is shown decoded)
because these are the cases where keeping the non-ASCII part of the heading IDs would be beneficial.

How relevant is this for non-HTML output? Any specific formats that have user-visible heading IDs that we should keep in mind?

With design goal 6, I don't see much choice other than keeping all or replacing all non-ASCII. So far, I still prefer the PR as-is (keep non-ASCII).

dereuromark · 2026-05-22T14:52:06Z

Which alternative do you prefer? Replace all non‑ASCII … so that CJK languages just get s-1, s-2?

I don't think that's the only alternative to "keep everything". There's a third option that I think is actually the right one and it's what we ended up landing in php‑djot#183: transliteration. Concretely, run Any-Latin; Latin-ASCII (or its baked, table‑driven equivalent for environments without ICU), then run your existing ASCII rule on top:

Heading	Spec‑literal id (this PR)	Transliterated id
`# Über uns`	`Über-uns`	`Uber-uns`
`# café résumé`	`café-résumé`	`cafe-resume`
`# Привет мир`	`Привет-мир`	`Privet-mir`
`# 日本語の見出し`	`日本語の見出し`	`ri-ben-yuno-jian-chushi` (intl) / `s-N` (fallback)
`# Bob's Guide` (smart‑punct on)	`Bob's-Guide`	`Bob-s-Guide`

That keeps the information content of the heading (so refs are still meaningful and stable across reordering, unlike s-N), and it makes the resulting URL fragment survive the share‑and‑click path that an anchor is for. We baked the Latin/Cyrillic/punctuation subset into a static table generated from ICU and pinned engine parity in tests, so that part is byte‑deterministic with or without ext-intl — which is the property you'd want the spec to give you anyway. (Romanizing exotic scripts like CJK is the one tier that still needs ICU at runtime and tracks the ICU version; without it those headings fall back to s-N.)

How many cases are there where non‑ASCII links are correctly used and shown to the user without encoding / percent‑encoded links are converted back to readable text?

The honest framing: non‑ASCII anchors work in the browser. The browser URL bar percent‑encodes on copy and decodes on display, and getElementById doesn't care. So if your use case is "click anchor on the page I'm currently reading", everything is fine and there's nothing to fix.

But that's not the primary use case of a heading id. The primary use case is the link leaving the page: pasted into Slack/Discord/Teams/WhatsApp, into an email body, into a Jira/Linear/Asana comment, into another markdown doc, into an LLM chat, into a tweet. None of those parse URLs — they autolink heuristically. Failure modes I've actually hit (not theoretical):

Truncation at the first non‑ASCII byte (…#Bob for …#Bob's-Guide) — silent dead link, looks correct to the sender.
Percent‑encoded blob length caps in chat previews (CJK headings become 30‑char %E…%E… runs and get cut).
NFC/NFD round‑trips between macOS sender and Linux/Windows receiver where ' (U+2019) on one side ≠ ' on the other side, so the fragment no longer equals any id on the page. Scroll silently fails.
AI ingestion: an LLM asked to "open" a pasted URL will frequently chop the non‑ASCII tail off the fragment when it normalizes the string, again silently.

So the set of cases where the non‑ASCII id is beneficial is narrow (browser address‑bar round‑trip on the same page). The set where it's harmful is "every time the link is shared", which is most of the lifetime of a deep link.

How relevant is this for non‑HTML output?

Quite. A few I've run into:

LaTeX: \label/\ref keys are conventionally ASCII; non‑ASCII compiles only with inputenc+hyperref configured exactly right, and named PDF destinations are ASCII by convention.
EPUB / OPF: id is an XML NCName — which technically permits many Unicode letters, so it's a tolerance problem, not a validity one: most readers cope, several don't, and EPUBcheck warns on some.
DocBook / Pandoc roundtrips: cross‑refs get mangled or warned about.
Sitemaps / RSS / Atom: URLs must be IRI‑normalized; ids that don't percent‑encode cleanly produce validator warnings.

It's the same underlying issue as the share case: anything that isn't a browser tends to treat the fragment as ASCII or warn.

With design goal 6, I don't see much choice other than keeping all or replacing all non‑ASCII.

I'd push back gently on the framing. The current rule already implicitly interprets Unicode — "non‑alphanumeric ASCII" means "leave the entire non‑ASCII Unicode space alone", which is itself a classification decision; it's just the most permissive one. So the spec already takes a position on Unicode, it just takes the easiest one.

Two ways forward that don't violate goal 6:

Implementation‑defined post‑step. Mandate the ASCII rule (this PR), and say explicitly that implementations may transliterate or strip non‑ASCII before applying it. That keeps the spec Unicode‑free while letting ports converge on portable output. Right now every port is going to deviate silently; making the deviation legible would help interop more than nailing it down would.
Static, published table. If you ever do want to specify it, ICU's Any-Latin; Latin-ASCII is a fixed published transform — you can bake it into a table and ship it as a normative file the same way Unicode itself ships data files. The spec text stays "apply this table, then the ASCII rule"; the spec doesn't have to know about Unicode beyond "here is a finite map".

The spec doesn't state what happens when the normalized text is empty (# !!!, # ###, an empty heading).

Worth resolving on this PR regardless of which direction you take on the non‑ASCII question — it's a real interop gap. One‑liner like "if the resulting id is empty, implementations generate a fallback id such as s followed by the heading's index" matches what djot.js does today and lets ports stop reverse‑engineering it.

Net of all that: I'd still gently lean against merging this PR #393 as written, because by specifying only the ASCII characters it leaves "preserve all non‑ASCII punctuation" as the de‑facto normative behavior — which is the one piece I think is unambiguously not what most users want. If you want the spec to stay Unicode‑agnostic, leaving non‑ASCII handling implementation‑defined (option 1 above) is the lighter‑touch fix and lets transliterating ports like php‑djot stay conformant.

So 👍 from me.

karlb · 2026-05-22T16:30:28Z

This PR does not change anything about non-ASCII characters compared to previous version. So we can merge it without any regression in that area, if we agree on the ASCII part. That should help to keep the discussions focused.

jgm · 2026-05-22T21:37:27Z

I'd push back gently on the framing. The current rule already implicitly interprets Unicode — "non‑alphanumeric ASCII" means "leave the entire non‑ASCII Unicode space alone", which is itself a classification decision; it's just the most permissive one. So the spec already takes a position on Unicode, it just takes the easiest one.

To recognize whether a character is in the ASCII set, you just need <= 127, and you don't even need to recognize Unicode code points; assuming UTF-8, you can just check the byte. Recognizing character classes (e.g. telling alphanumerics from symbols) is an entirely different matter. Let alone transliteration! (Even with all the Unicode data, that will only help you with accented characters, not with Russian, Chinese, Greek, or even Turkish.)

Yes, it is the easiest classifying position. That is the design goal -- to be easy -- not to require parsers to have access to huge tables of Unicode data, or libraries that include these.

I would say: as far as djot is concerned, identifiers with non-ASCII characters are fine. They are explicitly allowed in HTML. If your target format has stricter requirements, e.g. ASCII-only, or must start with a letter, you can easily handle this with an AST transformation or a transformation in the writer. (That's what pandoc does, for example, with LaTeX.)

I would also say that it is almost always better to use explicit identifiers if you plan to link to them or use them in formatting rules. That way, things won't break if you change the wording of the section title, or add another section ahead of it with the same title.

jgm · 2026-05-22T22:21:16Z

Another option would be to remove the description of automatic identifiers from the spec. This would allow implementations to provide such identifiers using whatever scheme makes most sense to them...or not to generate them at all (though it would still be necessary to generate identifiers for any sections that are linked to using the [Section title][] format). In this case, portable djot would need to use explicit identifiers whenever links to headings are used.

dereuromark · 2026-05-22T22:27:33Z

I am with @karlb on the middle ground for now. It at least allows a bit less divergence across implementations.
So totally fine with the PR as is.

And true: for anything you link across systems, explicit {#id} is the robust answer regardless.

…x to s- Replace the baked-in asciiHeadingIds bool with a pluggable id transform on HeadingIdTracker (Closure(string): string). The core stays pure jgm/djot#393 (unicode-preserving); ASCII folding now ships as AsciiHeadingIdsExtension, which sets the transform on both the renderer's tracker and the parser's heading-reference resolution so section ids and [Heading][] link targets stay in parity. The transform runs over the spec id and is re-slugged afterwards, so a transform that reintroduces separators (e.g. CJK romanization "ri ben yu") still yields a clean id. Also unify the leading-digit prefix from h- to s- so it matches the empty-heading s-N fallback (one prefix convention).

Rewrite the heading-id reference section for the jgm/djot#393 default (letter case and non-ASCII preserved, s- prefix for leading digits, s-N empty fallback) and document AsciiHeadingIdsExtension as the opt-in ASCII-folding path, including its parser/renderer parity and the note that registration order does not matter.

@var

…ension (#224) * Align heading-id slugs with jgm/djot#393; add opt-in asciiHeadingIds Default heading-id generation now follows the settled #393 rule: replace each maximal run of non-alphanumeric ASCII with a single '-' and trim leading/trailing '-', preserving letter case and all non-ASCII characters. This drops the previous always-on ASCII transliteration and the '_' exception, so 'Über café' becomes 'Über-café', 'under_score' becomes 'under-score', 'a--b' becomes 'a-b'. A leading-digit result keeps the 'h-' prefix for CSS-selector safety (orthogonal to #393, which governs punctuation only). ASCII transliteration is now opt-in via a new asciiHeadingIds option on DjotConverter, threaded to both the renderer and the parser's reference-resolution pass so heading ids stay in parity. With it enabled, 'Über café' becomes 'uber-cafe'. The official djot test suite is unchanged (its id cases are simple ASCII). * Add asciiHeadingIds to DjotConverter constructor docblock (phpcs) * Make ASCII heading ids a pluggable transform + extension; unify prefix to s- Replace the baked-in asciiHeadingIds bool with a pluggable id transform on HeadingIdTracker (Closure(string): string). The core stays pure jgm/djot#393 (unicode-preserving); ASCII folding now ships as AsciiHeadingIdsExtension, which sets the transform on both the renderer's tracker and the parser's heading-reference resolution so section ids and [Heading][] link targets stay in parity. The transform runs over the spec id and is re-slugged afterwards, so a transform that reintroduces separators (e.g. CJK romanization "ri ben yu") still yields a clean id. Also unify the leading-digit prefix from h- to s- so it matches the empty-heading s-N fallback (one prefix convention). * Fix @var annotation phpcbf mangled into invalid PHPStan syntax * docs: heading-id #393 default + AsciiHeadingIdsExtension Rewrite the heading-id reference section for the jgm/djot#393 default (letter case and non-ASCII preserved, s- prefix for leading digits, s-N empty fallback) and document AsciiHeadingIdsExtension as the opt-in ASCII-folding path, including its parser/renderer parity and the note that registration order does not matter. * Restore heading-id transformer Closure signature in a phpcbf-safe form The headingIdTransformer property lost its (string): string signature when phpcbf mangled the single-line annotation (6ea8855) into invalid syntax and the workaround dropped to a bare \Closure|null. Restore the signature with the two-line var + phpstan-var pattern already used by FrontmatterExtension: the plain var stays \Closure|null (phpcbf leaves it untouched) and the parenthesized phpstan-var carries the full (\Closure(string): string)|null. Verified phpcbf no longer mangles it; phpcs and phpstan are clean.

Re-vendor carve-lib from the #393 carve-js build and regenerate the heading-id corpus examples. Auto ids now preserve letter case and all non-ASCII characters, replace each maximal run of non-alphanumeric ASCII with a single '-', and use an 's-' prefix for a leading digit (previously lowercase + ASCII-fold + 'section-'). Add a NORMATIVE heading-identifier rule to resources/grammar.ebnf and update the case-study and feature docs. ASCII folding becomes an opt-in implementation mode (carve-js 'asciiHeadingIds' parse option; carve-php 'AsciiHeadingIdsExtension'). Part of the heading-id slug resolution (#71).

#73) #72 adopted jgm/djot#393's case-preserving ids, but case preservation broke </#id> cross-references: a lowercase ref no longer matched a case-preserved id (e.g. `# Getting Started` -> `Getting-Started`, so `</#getting-started>` failed). Switch carve to LOWERCASE (Unicode-aware; non-ASCII characters preserved, only their case folded) - the GitHub/SSG convention authors expect for anchors, which makes ids and the common cross-reference case-insensitive with no special lookup. Re-vendor the lowercase carve-lib, regenerate the heading-id corpus examples, and update the normative grammar rule + case-study/feature docs. carve now deliberately diverges from djot-php (case-preserving per #393); both keep the opt-in ASCII fold (carve-js asciiHeadingIds; carve-php AsciiHeadingIdsExtension). Refs #71.

Aligns slugify with carve spec #73: the auto heading-id rule now applies the jgm/djot#393 run-replacement (each maximal run of non-alphanumeric ASCII -> single '-', trimmed) and then lowercases (Unicode-aware), keeping non-ASCII verbatim. Lowercasing makes ids and the common </#id> / [Heading][] cross-references case-insensitive with no special lookup logic. ASCII transliteration is no longer the default: it moves behind an opt-in slugify(text, true), surfaced as the asciiHeadingIds parse option (carveToHtml / resolve). Empty-text and digit-leading slugs use the s / s- fallback. Bumps the spec submodule to the #73 lowercase corpus and updates the slug unit tests accordingly.

Aligns the heading-id slug with carve spec #73. normalizeId now applies the jgm/djot#393 run rule (each maximal run of non-alphanumeric ASCII -> single '-', trimmed) and then lowercases (Unicode-aware), keeping non-ASCII characters verbatim. Empty-text and digit-leading slugs use the s / s- fallback. This makes ids and </#id> / [Heading][] cross-references case-insensitive with no special lookup logic. ASCII transliteration is no longer the default. It moves behind a new AsciiHeadingIdsExtension, which sets a slug transformer on both the render-time tracker and the parse-time tracker (so implicit [Heading][] references resolve to the folded ids). Also fixes renderHeadingRef: an unresolved </#id> now renders as its literal source text instead of a dangling self-link, matching the spec and carve-js. This surfaced once ASCII refs stopped matching non-ASCII ids. Bumps the spec submodule to the #73 lowercase corpus and promotes the new 81-list-lazy-continuation category (already produced byte-identically).

A docs-wide clarity pass turned up factual drift between the case-study narrative and the current grammar/corpus: - case-study/syntax.md (heading ids): the algorithm + examples said ids are NFC-normalized and lowercased by default and "deliberately diverge from djot". The grammar (PART 9, jgm/djot#393) and corpus are the opposite: ids are CASE-PRESERVING with NO Unicode normalization, cross-refs resolve case-insensitively, and lowercasing is an opt-in transform. Rewrite the steps, the example table (Getting-Started, Café-Crème, Über-uns, s-2024-Recap, …), and the framing to match. (comparison.md / divergence-from-djot.md were already correct; syntax.md was the lone stale holdout.) - extensions.md: the footnote note implied inline footnotes are deferred. Inline `^[content]` is implemented Tier-1 core; only the sidenote `[>content]` form is deferred. Clarify. - native-features-analysis.md: label inline footnotes Tier-1 core (a carve addition), not "extension".

karlb mentioned this pull request May 18, 2026

Heading auto-ID: clarify punctuation handling (remove vs replace) #391

Open

dereuromark mentioned this pull request May 19, 2026

Exclude symbols from auto-generated heading identifiers php-collective/djot-php#181

Merged

This was referenced May 19, 2026

[discussion only — not for merge] Spec-literal jgm/djot#393 heading IDs (preserves non-ASCII) php-collective/djot-php#182

Closed

ASCII-safe heading IDs (transliteration; intl + deterministic baked-map fallback) php-collective/djot-php#183

Merged

dereuromark mentioned this pull request Jun 6, 2026

Align heading-id slugs with jgm/djot#393; pluggable ASCII-folding extension php-collective/djot-php#224

Merged

dereuromark mentioned this pull request Jun 6, 2026

Heading ids: adopt jgm/djot # 393 (case + non-ASCII preserved) markup-carve/carve#72

Merged

dereuromark mentioned this pull request Jun 6, 2026

Heading ids: lowercase (GitHub-style), superseding case-preserving #72 markup-carve/carve#73

Merged

This was referenced Jun 7, 2026

Heading ids: lowercase (GitHub-style) + opt-in ASCII fold markup-carve/carve-js#81

Merged

Heading ids: lowercase (GitHub-style) default + opt-in ASCII fold markup-carve/carve-php#68

Merged

dereuromark mentioned this pull request Jun 21, 2026

docs: fix stale heading-id + footnote facts (clarity audit) markup-carve/carve#193

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specify punctuation handling for auto-generated heading IDs#393

Specify punctuation handling for auto-generated heading IDs#393
karlb wants to merge 1 commit into
jgm:mainfrom
karlb:heading-auto-id

karlb commented May 18, 2026

Uh oh!

dereuromark commented May 19, 2026 •

edited

Loading

Uh oh!

karlb commented May 20, 2026

Uh oh!

jgm commented May 20, 2026

Uh oh!

dereuromark commented May 20, 2026

Uh oh!

karlb commented May 22, 2026

Uh oh!

dereuromark commented May 22, 2026 •

edited

Loading

Uh oh!

karlb commented May 22, 2026

Uh oh!

jgm commented May 22, 2026 •

edited

Loading

Uh oh!

jgm commented May 22, 2026

Uh oh!

dereuromark commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

karlb commented May 18, 2026

Uh oh!

dereuromark commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-ASCII punctuation / smart quotes in IDs — intended?

Minor: empty / all-punctuation headings

Downstream context (why we care)

Uh oh!

karlb commented May 20, 2026

Uh oh!

jgm commented May 20, 2026

Uh oh!

dereuromark commented May 20, 2026

Uh oh!

karlb commented May 22, 2026

Uh oh!

dereuromark commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karlb commented May 22, 2026

Uh oh!

jgm commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgm commented May 22, 2026

Uh oh!

dereuromark commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dereuromark commented May 19, 2026 •

edited

Loading

dereuromark commented May 22, 2026 •

edited

Loading

jgm commented May 22, 2026 •

edited

Loading

dereuromark commented May 22, 2026 •

edited

Loading