Skip to content

feat: match github's markdown fragment generation#2153

Open
katrinafyi wants to merge 23 commits intolycheeverse:masterfrom
rina-forks:fragment-slugify
Open

feat: match github's markdown fragment generation#2153
katrinafyi wants to merge 23 commits intolycheeverse:masterfrom
rina-forks:fragment-slugify

Conversation

@katrinafyi
Copy link
Copy Markdown
Member

@katrinafyi katrinafyi commented Apr 21, 2026

This copies the approach of Flet/github-slugger#56 for deriving fragment identifiers from markdown headings.

Basically, it deletes characters from certain Unicode categories, then it lowercases and replaces spaces with -. and finally applies a disambiguation with numbers.

To check the test cases, you can see this gist or create your own markdown file on github.

Fixes #2112. This can be verified with

$ cargo run -- https://raw.githubusercontent.com/adamlui/js-utils/refs/heads/main/minify.js/node.js/docs/zh-cn/README.md  -vvvvv --include-fragments

...

   [200] https://raw.githubusercontent.com/adamlui/js-utils/refs/heads/main/minify.js/node.js/docs/zh-cn/README.md#%EF%B8%8F-mit-%E8%AE%B8%E5%8F%AF%E8%AF%81 (at 28:10)

Comment thread lychee-lib/src/extract/markdown.rs Outdated
katrinafyi and others added 7 commits April 21, 2026 23:13
so, rust `to_lowercase` actually does *more* transformations than github
does which leads to differences in certain weird cases.

for example, SpecialCasing.txt says greek `Σ` should lowercase to `σ` in
most cases but `ς` when it ends a word. rust does this but github does
not.
it could be one-copy, but .to_lowercase() is probably much faster on
strings than individual characters
Copy link
Copy Markdown
Member

@mre mre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more fly-by comment. ;)

Comment thread lychee-lib/src/extract/fragments.rs
Comment thread lychee-lib/src/extract/fragments.rs Outdated
Comment on lines +97 to +99
if this_suffix.is_some() {
self.next_suffixes.insert(candidate.clone(), ONE);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand this logic?

If base_id is "foo" and it's seen for the second time, candidate becomes "foo-1". You then insert "foo-1" -> 1 into the map. If the very next heading in the document is actually titled "# foo 1", the generator will:

  1. Generate base_id = "foo-1".
  2. See that "foo-1" is already in the map.
  3. Increment it and produce "foo-1-1".

Doesn't this create a "jump" in numbering if headings naturally collide with generated suffixes? Would that be an issue?

Copy link
Copy Markdown
Member Author

@katrinafyi katrinafyi Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by jump. Do you mean because the foo-1-1 ID has two hyphens? Or, are you worried about seeing a "foo" next and skipping some numbers?

If it's the second case, the next "foo" will resume the sequence from 2 and no numbers will be skipped. The incrementing numbers are always appended onto the original base_id.

Edit: I should add that avoiding conflicts with generated suffixes is the most complicated part of this code. There's two ways to write this code, one that has complicated after-generarion logic (this code) and one that has complicated collision detection. I can try changing to the complicated collision detection version which avoids this conditional insert (but introduces conditional queries).

Edit 2: the behaviour of headings that conflict with generated suffixes can also be seen in this test case https://github.com/rina-forks/lychee/blob/41362802c150490473c06155d0e11d2ccc3a2c6e/lychee-lib/src/extract/fragments.rs#L212

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if 4161267 is easier to follow.

Comment thread lychee-lib/src/extract/fragments.rs
Comment thread test-utils/src/lib.rs Outdated
macro_rules! load_fixture {
($filename:expr) => {{
let path = fixtures_path!().join($filename);
let path = test_utils::fixtures_path!().join($filename);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change assumes that the crate using the macro has test_utils in its dependency tree under that specific name. It's not a big deal right now but if a sub-crate imports the macro but doesn't have test_utils as a direct dependency (or renames it), this will fail to compile. Using $crate::fixtures_path!() is usually safer for macros intended to be exported.

I haven't tested it, but this should compile and work:

#[macro_export]
macro_rules! load_fixture {
    ($filename:expr) => {{
        // $crate ensures that we always point to the fixtures_path 
        // defined inside this specific crate.
        let path = $crate::fixtures_path!().join($filename);
        std::fs::read_to_string(path).unwrap()
    }};
}

Same for the other macro below.

Copy link
Copy Markdown
Member

@mre mre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this. Added a few more comments, but they are quite nitpicky tbh. No blockers.

@katrinafyi katrinafyi changed the title fix: match github's markdown fragment generation feat: match github's markdown fragment generation May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Valid fragments w/ URL-encoded emoji false positive

2 participants