Skip to content

fix(pure-magic): relax try_csv to match libmagic semantics#36

Draft
tnaroska wants to merge 1 commit into
qjerome:mainfrom
tnaroska:fix/csv-min-rows
Draft

fix(pure-magic): relax try_csv to match libmagic semantics#36
tnaroska wants to merge 1 commit into
qjerome:mainfrom
tnaroska:fix/csv-min-rows

Conversation

@tnaroska
Copy link
Copy Markdown
Contributor

@tnaroska tnaroska commented May 23, 2026

tl;dr: this is to better match the CSV recognition in libmagic. Specific problem I observed with CSV files is that pure-magic didn't recognize small csv files (few lines) and reported text/plain instead.

Apparently pure-magic required at least 10 lines of csv to identify the file type, whereas the check in libmagic matches csv with a minimum of two lines (1 header, and 1 data).

Summary

try_csv (pure-magic/src/lib.rs) required exactly 10 records before reporting text/csv, so typical small CSVs (configs, fixtures, short exports) were silently classified as plain text. This PR relaxes the threshold to match upstream libmagic and adds regression tests.

Reproducer (before this PR)

$ cargo build --release -p wiza
$ printf 'a,b\n1,2\n3,4\n' > /tmp/t.csv
$ printf 'a,b,c\n1,2,3\n4,5,6\n7,8,9\n10,11,12\n' > /tmp/short.csv
$ ./target/release/wiza /tmp/t.csv /tmp/short.csv
/tmp/t.csv     source:hardcoded strength:0 mime:text/plain magic:ASCII text
/tmp/short.csv source:hardcoded strength:0 mime:text/plain magic:ASCII text

$ file /tmp/t.csv /tmp/short.csv
/tmp/t.csv:     CSV text
/tmp/short.csv: CSV text

After this PR both files report mime:text/csv.

Why the old threshold was wrong

Upstream file/file/src/is_csv.c::csv_parse accepts any input where tf > 1 && nl >= 2 (≥2 lines with consistent column count, ≥2 fields per row). CSV_LINES = 10 is only an early-exit cap in the upstream parser, not a minimum. The 10-record floor in try_csv produced false negatives that diverge from file(1).

Upstream history confirms the policy: in 2023, file(1) explicitly loosened its detector with PR/463 (b4e621d1, "CSV can be also only 2 lines") — so even the conservative reference treats 10 as too strict.

What changed

  • Threshold: n != 10n < 2. Reads records until EOF and accepts any input with ≥2 records and consistent column count, mirroring tf > 1 && nl >= 2 in is_csv.c.
  • Header inference: switched from csv::Reader::from_reader(...) (which treats the first line as a header by default) to csv::ReaderBuilder::new().has_headers(false).from_reader(...). libmagic counts newlines, not data rows, so a 2-line a,b\n1,2\n must qualify; without this change, csv::Reader consumed a,b as a header and only saw one data record, leaving n = 1.
  • Regression tests in the existing #[cfg(test)] mod tests block, using a tiny csv_magic() helper that reuses first_magic with a never-matching rule so only the hardcoded CSV detector is exercised:
    • test_csv_two_rows_two_cols — minimal 2-row CSV (the previously-failing case).
    • test_csv_short_consistent_rows — 5-row CSV.
    • test_csv_many_rows_still_detected — 12-row CSV (preserves the previously-passing case).
    • test_csv_single_field_rejectedtf > 1 boundary; multi-line single-column text must NOT be detected as CSV.
    • test_csv_ragged_columns_rejected — column-count consistency check.

Non-goals

  • Semicolon/tab/pipe separators. Upstream file(1) only auto-detects comma; this PR keeps that behavior.

Test plan

  • cargo test -p pure-magic — 139 passed (134 existing + 5 new), 9 doctests pass.
  • cargo clippy -p pure-magic --no-deps — clean.
  • Manual wiza parity check on 2-row, 5-row, 12-row, and ragged CSVs against system file(1).
  • Maintainer review.

🤖 Generated with Claude Code

The hardcoded CSV detector required exactly 10 records before reporting
text/csv, so typical small CSVs (configs, fixtures, short exports) were
silently classified as plain ASCII text. Upstream libmagic's is_csv.c
treats CSV_LINES as an early-exit cap, not a minimum, and accepts any
input with `tf > 1 && nl >= 2` — file(1) itself loosened this in 2023
(PR/463 "CSV can be also only 2 lines").

Drop the 10-record floor: read records until EOF, require >=2 records
with consistent column count. Disable csv::Reader's header inference
since libmagic counts newlines (not data rows), so a 2-line
"a,b\n1,2\n" must qualify.

Add five regression tests covering: 2-row positive, 5-row positive,
12-row positive (the previously-passing case), single-field reject,
ragged-columns reject.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@tnaroska tnaroska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few annotations on the change locations.

Comment thread pure-magic/src/lib.rs
let buf = haystack.read_range(0..FILE_BYTES_MAX as u64)?;
let mut reader = csv::Reader::from_reader(io::Cursor::new(buf));
let mut reader = csv::ReaderBuilder::new()
.has_headers(false)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_headers(false) is the load-bearing bit for two-line CSVs. With the default true, csv::Reader consumes a,b as a header and records() then yields only one data row from a,b\n1,2\nn stays at 1 and the n < 2 reject fires. libmagic counts newlines, not data rows, so we need every line in the count.

Comment thread pure-magic/src/lib.rs
return Ok(false);
}
} else {
for i in records {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped .take(9) so we read until EOF / FILE_BYTES_MAX. Loop exits early on the first ragged row or parse error, so the cost is bounded by the buffer the haystack already provides.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove .take up to 7MB (i.e. FILE_BYTES_MAX) of CSV can be parsed for nothing, as we only care about the 2 first lines. This needs to be adjusted not to impact perfs on edge cases.

Comment thread pure-magic/src/lib.rs

// we need at least 10 lines
if n != 10 {
if n < 2 {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n < 2 matches upstream is_csv.c: return tf > 1 && nl >= 2. The first.len() <= 1 check just above is the tf > 1 half (≥2 fields per row), and this is the nl >= 2 half (≥2 records).

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the code equivalent to tf > 1

Comment thread pure-magic/src/lib.rs
}

// we already parsed first line
let mut n = 1;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be adjusted, because you don't parse the header anymore.

Comment thread pure-magic/src/lib.rs

// we need at least 10 lines
if n != 10 {
if n < 2 {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the code equivalent to tf > 1

Comment thread pure-magic/src/lib.rs
return Ok(false);
}
} else {
for i in records {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove .take up to 7MB (i.e. FILE_BYTES_MAX) of CSV can be parsed for nothing, as we only care about the 2 first lines. This needs to be adjusted not to impact perfs on edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants