fix(pure-magic): relax try_csv to match libmagic semantics#36
Conversation
The hardcoded CSV detector required exactly 10 records before reporting text/csv, so typical small CSVs (configs, fixtures, short exports) were silently classified as plain ASCII text. Upstream libmagic's is_csv.c treats CSV_LINES as an early-exit cap, not a minimum, and accepts any input with `tf > 1 && nl >= 2` — file(1) itself loosened this in 2023 (PR/463 "CSV can be also only 2 lines"). Drop the 10-record floor: read records until EOF, require >=2 records with consistent column count. Disable csv::Reader's header inference since libmagic counts newlines (not data rows), so a 2-line "a,b\n1,2\n" must qualify. Add five regression tests covering: 2-row positive, 5-row positive, 12-row positive (the previously-passing case), single-field reject, ragged-columns reject. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tnaroska
left a comment
There was a problem hiding this comment.
A few annotations on the change locations.
| let buf = haystack.read_range(0..FILE_BYTES_MAX as u64)?; | ||
| let mut reader = csv::Reader::from_reader(io::Cursor::new(buf)); | ||
| let mut reader = csv::ReaderBuilder::new() | ||
| .has_headers(false) |
There was a problem hiding this comment.
has_headers(false) is the load-bearing bit for two-line CSVs. With the default true, csv::Reader consumes a,b as a header and records() then yields only one data row from a,b\n1,2\n — n stays at 1 and the n < 2 reject fires. libmagic counts newlines, not data rows, so we need every line in the count.
| return Ok(false); | ||
| } | ||
| } else { | ||
| for i in records { |
There was a problem hiding this comment.
Dropped .take(9) so we read until EOF / FILE_BYTES_MAX. Loop exits early on the first ragged row or parse error, so the cost is bounded by the buffer the haystack already provides.
There was a problem hiding this comment.
If you remove .take up to 7MB (i.e. FILE_BYTES_MAX) of CSV can be parsed for nothing, as we only care about the 2 first lines. This needs to be adjusted not to impact perfs on edge cases.
|
|
||
| // we need at least 10 lines | ||
| if n != 10 { | ||
| if n < 2 { |
There was a problem hiding this comment.
n < 2 matches upstream is_csv.c: return tf > 1 && nl >= 2. The first.len() <= 1 check just above is the tf > 1 half (≥2 fields per row), and this is the nl >= 2 half (≥2 records).
There was a problem hiding this comment.
I don't see the code equivalent to tf > 1
| } | ||
|
|
||
| // we already parsed first line | ||
| let mut n = 1; |
There was a problem hiding this comment.
This must be adjusted, because you don't parse the header anymore.
|
|
||
| // we need at least 10 lines | ||
| if n != 10 { | ||
| if n < 2 { |
There was a problem hiding this comment.
I don't see the code equivalent to tf > 1
| return Ok(false); | ||
| } | ||
| } else { | ||
| for i in records { |
There was a problem hiding this comment.
If you remove .take up to 7MB (i.e. FILE_BYTES_MAX) of CSV can be parsed for nothing, as we only care about the 2 first lines. This needs to be adjusted not to impact perfs on edge cases.
tl;dr: this is to better match the CSV recognition in libmagic. Specific problem I observed with CSV files is that pure-magic didn't recognize small csv files (few lines) and reported
text/plaininstead.Apparently pure-magic required at least 10 lines of csv to identify the file type, whereas the check in libmagic matches csv with a minimum of two lines (1 header, and 1 data).
Summary
try_csv(pure-magic/src/lib.rs) required exactly 10 records before reportingtext/csv, so typical small CSVs (configs, fixtures, short exports) were silently classified as plain text. This PR relaxes the threshold to match upstream libmagic and adds regression tests.Reproducer (before this PR)
After this PR both files report
mime:text/csv.Why the old threshold was wrong
Upstream
file/file/src/is_csv.c::csv_parseaccepts any input wheretf > 1 && nl >= 2(≥2 lines with consistent column count, ≥2 fields per row).CSV_LINES = 10is only an early-exit cap in the upstream parser, not a minimum. The 10-record floor intry_csvproduced false negatives that diverge fromfile(1).Upstream history confirms the policy: in 2023, file(1) explicitly loosened its detector with PR/463 (
b4e621d1, "CSV can be also only 2 lines") — so even the conservative reference treats 10 as too strict.What changed
n != 10→n < 2. Reads records until EOF and accepts any input with ≥2 records and consistent column count, mirroringtf > 1 && nl >= 2inis_csv.c.csv::Reader::from_reader(...)(which treats the first line as a header by default) tocsv::ReaderBuilder::new().has_headers(false).from_reader(...). libmagic counts newlines, not data rows, so a 2-linea,b\n1,2\nmust qualify; without this change,csv::Readerconsumeda,bas a header and only saw one data record, leavingn = 1.#[cfg(test)] mod testsblock, using a tinycsv_magic()helper that reusesfirst_magicwith a never-matching rule so only the hardcoded CSV detector is exercised:test_csv_two_rows_two_cols— minimal 2-row CSV (the previously-failing case).test_csv_short_consistent_rows— 5-row CSV.test_csv_many_rows_still_detected— 12-row CSV (preserves the previously-passing case).test_csv_single_field_rejected—tf > 1boundary; multi-line single-column text must NOT be detected as CSV.test_csv_ragged_columns_rejected— column-count consistency check.Non-goals
file(1)only auto-detects comma; this PR keeps that behavior.Test plan
cargo test -p pure-magic— 139 passed (134 existing + 5 new), 9 doctests pass.cargo clippy -p pure-magic --no-deps— clean.wizaparity check on 2-row, 5-row, 12-row, and ragged CSVs against systemfile(1).🤖 Generated with Claude Code