feat: match github's markdown fragment generation#2153
feat: match github's markdown fragment generation#2153katrinafyi wants to merge 23 commits intolycheeverse:masterfrom
Conversation
so, rust `to_lowercase` actually does *more* transformations than github does which leads to differences in certain weird cases. for example, SpecialCasing.txt says greek `Σ` should lowercase to `σ` in most cases but `ς` when it ends a word. rust does this but github does not.
it could be one-copy, but .to_lowercase() is probably much faster on strings than individual characters
| if this_suffix.is_some() { | ||
| self.next_suffixes.insert(candidate.clone(), ONE); | ||
| } |
There was a problem hiding this comment.
Can you help me understand this logic?
If base_id is "foo" and it's seen for the second time, candidate becomes "foo-1". You then insert "foo-1" -> 1 into the map. If the very next heading in the document is actually titled "# foo 1", the generator will:
- Generate base_id = "foo-1".
- See that "foo-1" is already in the map.
- Increment it and produce "foo-1-1".
Doesn't this create a "jump" in numbering if headings naturally collide with generated suffixes? Would that be an issue?
There was a problem hiding this comment.
I'm not sure what you mean by jump. Do you mean because the foo-1-1 ID has two hyphens? Or, are you worried about seeing a "foo" next and skipping some numbers?
If it's the second case, the next "foo" will resume the sequence from 2 and no numbers will be skipped. The incrementing numbers are always appended onto the original base_id.
Edit: I should add that avoiding conflicts with generated suffixes is the most complicated part of this code. There's two ways to write this code, one that has complicated after-generarion logic (this code) and one that has complicated collision detection. I can try changing to the complicated collision detection version which avoids this conditional insert (but introduces conditional queries).
Edit 2: the behaviour of headings that conflict with generated suffixes can also be seen in this test case https://github.com/rina-forks/lychee/blob/41362802c150490473c06155d0e11d2ccc3a2c6e/lychee-lib/src/extract/fragments.rs#L212
| macro_rules! load_fixture { | ||
| ($filename:expr) => {{ | ||
| let path = fixtures_path!().join($filename); | ||
| let path = test_utils::fixtures_path!().join($filename); |
There was a problem hiding this comment.
This change assumes that the crate using the macro has test_utils in its dependency tree under that specific name. It's not a big deal right now but if a sub-crate imports the macro but doesn't have test_utils as a direct dependency (or renames it), this will fail to compile. Using $crate::fixtures_path!() is usually safer for macros intended to be exported.
I haven't tested it, but this should compile and work:
#[macro_export]
macro_rules! load_fixture {
($filename:expr) => {{
// $crate ensures that we always point to the fixtures_path
// defined inside this specific crate.
let path = $crate::fixtures_path!().join($filename);
std::fs::read_to_string(path).unwrap()
}};
}Same for the other macro below.
mre
left a comment
There was a problem hiding this comment.
We can merge this. Added a few more comments, but they are quite nitpicky tbh. No blockers.
This copies the approach of Flet/github-slugger#56 for deriving fragment identifiers from markdown headings.
Basically, it deletes characters from certain Unicode categories, then it lowercases and replaces spaces with
-. and finally applies a disambiguation with numbers.To check the test cases, you can see this gist or create your own markdown file on github.
Fixes #2112. This can be verified with