Specify punctuation handling for auto-generated heading IDs#393
Conversation
Replace each maximal run of non-alphanumeric ASCII characters with `-` and trim leading/trailing `-`. This settles remove-vs-replace as replace and removes the previous exceptions for individual punctuation characters (jgm#391).
Non-ASCII punctuation / smart quotes in IDs — intended?"non-alphanumeric ASCII" only targets ASCII, so any non-ASCII code point is preserved — including non-ASCII punctuation. With smart punctuation enabled (the default in many setups), straight quotes/apostrophes become U+2019 / U+201C / U+201D before ID generation, so:
These are valid CSS identifiers (code points ≥ U+0080 are fine), and consistent reference resolution still works, but Would it be worth distinguishing non-ASCII letters/digits (preserved) from non-ASCII punctuation and symbols (replaced like their ASCII counterparts)? That keeps An important consideration: Links copied and shared via websites, tools or alike will usually be "auto-linked". When an anchor leaves the page (copied into Slack/email/a chat/another markdown doc), it gets auto-linked by a heuristic, not a parser.
As such, the safest option would be something at least close to ASCII. Minor: empty / all-punctuation headingsThe spec doesn't state what happens when the normalized text is empty ( Downstream context (why we care)In djot-php, heading IDs are consumed by TLDR |
I don't think this should be the case. From the spec:
I read that as taking the input from the source before any further processing and it also seems to be what djot.js does.
I would try hard not to interpret unicode unless strictly necessary. AFAICS, Djot avoided doing that successfully so far.
When I copy non-ASCII links from my browser, they get percent encoded. Which case do you have in mind that breaks? Copying from the HTML source? |
|
Yes, see design goal #6 in the README. |
The length is dramatically increased, and to a point its unreadable.
Indeed. For any internal stuff I agree. But here, we deal with external (web) URLs and we need to be able to allow copying them and distributing them through a variety of outside systems without breaking too easily. |
Which alternative do you prefer? Replace all non-ASCII characters, so that CJK languages just get
How relevant is this for non-HTML output? Any specific formats that have user-visible heading IDs that we should keep in mind? With design goal 6, I don't see much choice other than keeping all or replacing all non-ASCII. So far, I still prefer the PR as-is (keep non-ASCII). |
I don't think that's the only alternative to "keep everything". There's a third option that I think is actually the right one and it's what we ended up landing in php‑djot#183: transliteration. Concretely, run
That keeps the information content of the heading (so refs are still meaningful and stable across reordering, unlike
The honest framing: non‑ASCII anchors work in the browser. The browser URL bar percent‑encodes on copy and decodes on display, and But that's not the primary use case of a heading id. The primary use case is the link leaving the page: pasted into Slack/Discord/Teams/WhatsApp, into an email body, into a Jira/Linear/Asana comment, into another markdown doc, into an LLM chat, into a tweet. None of those parse URLs — they autolink heuristically. Failure modes I've actually hit (not theoretical):
So the set of cases where the non‑ASCII id is beneficial is narrow (browser address‑bar round‑trip on the same page). The set where it's harmful is "every time the link is shared", which is most of the lifetime of a deep link.
Quite. A few I've run into:
It's the same underlying issue as the share case: anything that isn't a browser tends to treat the fragment as ASCII or warn.
I'd push back gently on the framing. The current rule already implicitly interprets Unicode — "non‑alphanumeric ASCII" means "leave the entire non‑ASCII Unicode space alone", which is itself a classification decision; it's just the most permissive one. So the spec already takes a position on Unicode, it just takes the easiest one. Two ways forward that don't violate goal 6:
Worth resolving on this PR regardless of which direction you take on the non‑ASCII question — it's a real interop gap. One‑liner like "if the resulting id is empty, implementations generate a fallback id such as Net of all that: I'd still gently lean against merging this PR #393 as written, because by specifying only the ASCII characters it leaves "preserve all non‑ASCII punctuation" as the de‑facto normative behavior — which is the one piece I think is unambiguously not what most users want. If you want the spec to stay Unicode‑agnostic, leaving non‑ASCII handling implementation‑defined (option 1 above) is the lighter‑touch fix and lets transliterating ports like php‑djot stay conformant. So 👍 from me. |
|
This PR does not change anything about non-ASCII characters compared to previous version. So we can merge it without any regression in that area, if we agree on the ASCII part. That should help to keep the discussions focused. |
To recognize whether a character is in the ASCII set, you just need Yes, it is the easiest classifying position. That is the design goal -- to be easy -- not to require parsers to have access to huge tables of Unicode data, or libraries that include these. I would say: as far as djot is concerned, identifiers with non-ASCII characters are fine. They are explicitly allowed in HTML. If your target format has stricter requirements, e.g. ASCII-only, or must start with a letter, you can easily handle this with an AST transformation or a transformation in the writer. (That's what pandoc does, for example, with LaTeX.) I would also say that it is almost always better to use explicit identifiers if you plan to link to them or use them in formatting rules. That way, things won't break if you change the wording of the section title, or add another section ahead of it with the same title. |
|
Another option would be to remove the description of automatic identifiers from the spec. This would allow implementations to provide such identifiers using whatever scheme makes most sense to them...or not to generate them at all (though it would still be necessary to generate identifiers for any sections that are linked to using the |
|
I am with @karlb on the middle ground for now. It at least allows a bit less divergence across implementations. And true: for anything you link across systems, explicit {#id} is the robust answer regardless. |
…x to s- Replace the baked-in asciiHeadingIds bool with a pluggable id transform on HeadingIdTracker (Closure(string): string). The core stays pure jgm/djot#393 (unicode-preserving); ASCII folding now ships as AsciiHeadingIdsExtension, which sets the transform on both the renderer's tracker and the parser's heading-reference resolution so section ids and [Heading][] link targets stay in parity. The transform runs over the spec id and is re-slugged afterwards, so a transform that reintroduces separators (e.g. CJK romanization "ri ben yu") still yields a clean id. Also unify the leading-digit prefix from h- to s- so it matches the empty-heading s-N fallback (one prefix convention).
Rewrite the heading-id reference section for the jgm/djot#393 default (letter case and non-ASCII preserved, s- prefix for leading digits, s-N empty fallback) and document AsciiHeadingIdsExtension as the opt-in ASCII-folding path, including its parser/renderer parity and the note that registration order does not matter.
…ension (#224) * Align heading-id slugs with jgm/djot#393; add opt-in asciiHeadingIds Default heading-id generation now follows the settled #393 rule: replace each maximal run of non-alphanumeric ASCII with a single '-' and trim leading/trailing '-', preserving letter case and all non-ASCII characters. This drops the previous always-on ASCII transliteration and the '_' exception, so 'Über café' becomes 'Über-café', 'under_score' becomes 'under-score', 'a--b' becomes 'a-b'. A leading-digit result keeps the 'h-' prefix for CSS-selector safety (orthogonal to #393, which governs punctuation only). ASCII transliteration is now opt-in via a new asciiHeadingIds option on DjotConverter, threaded to both the renderer and the parser's reference-resolution pass so heading ids stay in parity. With it enabled, 'Über café' becomes 'uber-cafe'. The official djot test suite is unchanged (its id cases are simple ASCII). * Add asciiHeadingIds to DjotConverter constructor docblock (phpcs) * Make ASCII heading ids a pluggable transform + extension; unify prefix to s- Replace the baked-in asciiHeadingIds bool with a pluggable id transform on HeadingIdTracker (Closure(string): string). The core stays pure jgm/djot#393 (unicode-preserving); ASCII folding now ships as AsciiHeadingIdsExtension, which sets the transform on both the renderer's tracker and the parser's heading-reference resolution so section ids and [Heading][] link targets stay in parity. The transform runs over the spec id and is re-slugged afterwards, so a transform that reintroduces separators (e.g. CJK romanization "ri ben yu") still yields a clean id. Also unify the leading-digit prefix from h- to s- so it matches the empty-heading s-N fallback (one prefix convention). * Fix @var annotation phpcbf mangled into invalid PHPStan syntax * docs: heading-id #393 default + AsciiHeadingIdsExtension Rewrite the heading-id reference section for the jgm/djot#393 default (letter case and non-ASCII preserved, s- prefix for leading digits, s-N empty fallback) and document AsciiHeadingIdsExtension as the opt-in ASCII-folding path, including its parser/renderer parity and the note that registration order does not matter. * Restore heading-id transformer Closure signature in a phpcbf-safe form The headingIdTransformer property lost its (string): string signature when phpcbf mangled the single-line annotation (6ea8855) into invalid syntax and the workaround dropped to a bare \Closure|null. Restore the signature with the two-line var + phpstan-var pattern already used by FrontmatterExtension: the plain var stays \Closure|null (phpcbf leaves it untouched) and the parenthesized phpstan-var carries the full (\Closure(string): string)|null. Verified phpcbf no longer mangles it; phpcs and phpstan are clean.
Re-vendor carve-lib from the #393 carve-js build and regenerate the heading-id corpus examples. Auto ids now preserve letter case and all non-ASCII characters, replace each maximal run of non-alphanumeric ASCII with a single '-', and use an 's-' prefix for a leading digit (previously lowercase + ASCII-fold + 'section-'). Add a NORMATIVE heading-identifier rule to resources/grammar.ebnf and update the case-study and feature docs. ASCII folding becomes an opt-in implementation mode (carve-js 'asciiHeadingIds' parse option; carve-php 'AsciiHeadingIdsExtension'). Part of the heading-id slug resolution (#71).
#73) #72 adopted jgm/djot#393's case-preserving ids, but case preservation broke </#id> cross-references: a lowercase ref no longer matched a case-preserved id (e.g. `# Getting Started` -> `Getting-Started`, so `</#getting-started>` failed). Switch carve to LOWERCASE (Unicode-aware; non-ASCII characters preserved, only their case folded) - the GitHub/SSG convention authors expect for anchors, which makes ids and the common cross-reference case-insensitive with no special lookup. Re-vendor the lowercase carve-lib, regenerate the heading-id corpus examples, and update the normative grammar rule + case-study/feature docs. carve now deliberately diverges from djot-php (case-preserving per #393); both keep the opt-in ASCII fold (carve-js asciiHeadingIds; carve-php AsciiHeadingIdsExtension). Refs #71.
Aligns slugify with carve spec #73: the auto heading-id rule now applies the jgm/djot#393 run-replacement (each maximal run of non-alphanumeric ASCII -> single '-', trimmed) and then lowercases (Unicode-aware), keeping non-ASCII verbatim. Lowercasing makes ids and the common </#id> / [Heading][] cross-references case-insensitive with no special lookup logic. ASCII transliteration is no longer the default: it moves behind an opt-in slugify(text, true), surfaced as the asciiHeadingIds parse option (carveToHtml / resolve). Empty-text and digit-leading slugs use the s / s- fallback. Bumps the spec submodule to the #73 lowercase corpus and updates the slug unit tests accordingly.
Aligns the heading-id slug with carve spec #73. normalizeId now applies the jgm/djot#393 run rule (each maximal run of non-alphanumeric ASCII -> single '-', trimmed) and then lowercases (Unicode-aware), keeping non-ASCII characters verbatim. Empty-text and digit-leading slugs use the s / s- fallback. This makes ids and </#id> / [Heading][] cross-references case-insensitive with no special lookup logic. ASCII transliteration is no longer the default. It moves behind a new AsciiHeadingIdsExtension, which sets a slug transformer on both the render-time tracker and the parse-time tracker (so implicit [Heading][] references resolve to the folded ids). Also fixes renderHeadingRef: an unresolved </#id> now renders as its literal source text instead of a dangling self-link, matching the spec and carve-js. This surfaced once ASCII refs stopped matching non-ASCII ids. Bumps the spec submodule to the #73 lowercase corpus and promotes the new 81-list-lazy-continuation category (already produced byte-identically).
A docs-wide clarity pass turned up factual drift between the case-study narrative and the current grammar/corpus: - case-study/syntax.md (heading ids): the algorithm + examples said ids are NFC-normalized and lowercased by default and "deliberately diverge from djot". The grammar (PART 9, jgm/djot#393) and corpus are the opposite: ids are CASE-PRESERVING with NO Unicode normalization, cross-refs resolve case-insensitively, and lowercasing is an opt-in transform. Rewrite the steps, the example table (Getting-Started, Café-Crème, Über-uns, s-2024-Recap, …), and the framing to match. (comparison.md / divergence-from-djot.md were already correct; syntax.md was the lone stale holdout.) - extensions.md: the footnote note implied inline footnotes are deferred. Inline `^[content]` is implemented Tier-1 core; only the sidenote `[>content]` form is deferred. Clarify. - native-features-analysis.md: label inline footnotes Tier-1 core (a carve addition), not "extension".
Replace each maximal run of non-alphanumeric ASCII characters with
-and trim leading/trailing-. This settles remove-vs-replace as replace and removes the previous exceptions for individual punctuation characters (#391).