fix(pure-magic): relax try_csv to match libmagic semantics by tnaroska · Pull Request #36 · qjerome/magic-rs

tnaroska · 2026-05-23T07:29:07Z

tl;dr: this is to better match the CSV recognition in libmagic. Specific problem I observed with CSV files is that pure-magic didn't recognize small csv files (few lines) and reported text/plain instead.

Apparently pure-magic required at least 10 lines of csv to identify the file type, whereas the check in libmagic matches csv with a minimum of two lines (1 header, and 1 data).

Summary

try_csv (pure-magic/src/lib.rs) required exactly 10 records before reporting text/csv, so typical small CSVs (configs, fixtures, short exports) were silently classified as plain text. This PR relaxes the threshold to match upstream libmagic and adds regression tests.

Reproducer (before this PR)

$ cargo build --release -p wiza
$ printf 'a,b\n1,2\n3,4\n' > /tmp/t.csv
$ printf 'a,b,c\n1,2,3\n4,5,6\n7,8,9\n10,11,12\n' > /tmp/short.csv
$ ./target/release/wiza /tmp/t.csv /tmp/short.csv
/tmp/t.csv     source:hardcoded strength:0 mime:text/plain magic:ASCII text
/tmp/short.csv source:hardcoded strength:0 mime:text/plain magic:ASCII text

$ file /tmp/t.csv /tmp/short.csv
/tmp/t.csv:     CSV text
/tmp/short.csv: CSV text

After this PR both files report mime:text/csv.

Why the old threshold was wrong

Upstream file/file/src/is_csv.c::csv_parse accepts any input where tf > 1 && nl >= 2 (≥2 lines with consistent column count, ≥2 fields per row). CSV_LINES = 10 is only an early-exit cap in the upstream parser, not a minimum. The 10-record floor in try_csv produced false negatives that diverge from file(1).

Upstream history confirms the policy: in 2023, file(1) explicitly loosened its detector with PR/463 (b4e621d1, "CSV can be also only 2 lines") — so even the conservative reference treats 10 as too strict.

What changed

Threshold: n != 10 → n < 2. Reads records until EOF and accepts any input with ≥2 records and consistent column count, mirroring tf > 1 && nl >= 2 in is_csv.c.
Header inference: switched from csv::Reader::from_reader(...) (which treats the first line as a header by default) to csv::ReaderBuilder::new().has_headers(false).from_reader(...). libmagic counts newlines, not data rows, so a 2-line a,b\n1,2\n must qualify; without this change, csv::Reader consumed a,b as a header and only saw one data record, leaving n = 1.
Regression tests in the existing #[cfg(test)] mod tests block, using a tiny csv_magic() helper that reuses first_magic with a never-matching rule so only the hardcoded CSV detector is exercised:
- test_csv_two_rows_two_cols — minimal 2-row CSV (the previously-failing case).
- test_csv_short_consistent_rows — 5-row CSV.
- test_csv_many_rows_still_detected — 12-row CSV (preserves the previously-passing case).
- test_csv_single_field_rejected — tf > 1 boundary; multi-line single-column text must NOT be detected as CSV.
- test_csv_ragged_columns_rejected — column-count consistency check.

Non-goals

Semicolon/tab/pipe separators. Upstream file(1) only auto-detects comma; this PR keeps that behavior.

Test plan

cargo test -p pure-magic — 139 passed (134 existing + 5 new), 9 doctests pass.
cargo clippy -p pure-magic --no-deps — clean.
Manual wiza parity check on 2-row, 5-row, 12-row, and ragged CSVs against system file(1).
Maintainer review.

🤖 Generated with Claude Code

The hardcoded CSV detector required exactly 10 records before reporting text/csv, so typical small CSVs (configs, fixtures, short exports) were silently classified as plain ASCII text. Upstream libmagic's is_csv.c treats CSV_LINES as an early-exit cap, not a minimum, and accepts any input with `tf > 1 && nl >= 2` — file(1) itself loosened this in 2023 (PR/463 "CSV can be also only 2 lines"). Drop the 10-record floor: read records until EOF, require >=2 records with consistent column count. Disable csv::Reader's header inference since libmagic counts newlines (not data rows), so a 2-line "a,b\n1,2\n" must qualify. Add five regression tests covering: 2-row positive, 5-row positive, 12-row positive (the previously-passing case), single-field reject, ragged-columns reject. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

tnaroska

A few annotations on the change locations.

tnaroska · 2026-05-23T07:31:13Z

        let buf = haystack.read_range(0..FILE_BYTES_MAX as u64)?;
-        let mut reader = csv::Reader::from_reader(io::Cursor::new(buf));
+        let mut reader = csv::ReaderBuilder::new()
+            .has_headers(false)


has_headers(false) is the load-bearing bit for two-line CSVs. With the default true, csv::Reader consumes a,b as a header and records() then yields only one data row from a,b\n1,2\n — n stays at 1 and the n < 2 reject fires. libmagic counts newlines, not data rows, so we need every line in the count.

tnaroska · 2026-05-23T07:31:13Z

-                    return Ok(false);
-                }
-            } else {
+        for i in records {


Dropped .take(9) so we read until EOF / FILE_BYTES_MAX. Loop exits early on the first ragged row or parse error, so the cost is bounded by the buffer the haystack already provides.

If you remove .take up to 7MB (i.e. FILE_BYTES_MAX) of CSV can be parsed for nothing, as we only care about the 2 first lines. This needs to be adjusted not to impact perfs on edge cases.

tnaroska · 2026-05-23T07:31:13Z


-        // we need at least 10 lines
-        if n != 10 {
+        if n < 2 {


n < 2 matches upstream is_csv.c: return tf > 1 && nl >= 2. The first.len() <= 1 check just above is the tf > 1 half (≥2 fields per row), and this is the nl >= 2 half (≥2 records).

I don't see the code equivalent to tf > 1

qjerome · 2026-05-26T07:51:17Z

        }

-        // we already parsed first line
        let mut n = 1;


This must be adjusted, because you don't parse the header anymore.

qjerome · 2026-05-26T07:56:50Z


-        // we need at least 10 lines
-        if n != 10 {
+        if n < 2 {


I don't see the code equivalent to tf > 1

qjerome · 2026-05-26T07:57:03Z

-                    return Ok(false);
-                }
-            } else {
+        for i in records {


If you remove .take up to 7MB (i.e. FILE_BYTES_MAX) of CSV can be parsed for nothing, as we only care about the 2 first lines. This needs to be adjusted not to impact perfs on edge cases.

tnaroska commented May 23, 2026

View reviewed changes

qjerome reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pure-magic): relax try_csv to match libmagic semantics#36

fix(pure-magic): relax try_csv to match libmagic semantics#36
tnaroska wants to merge 1 commit into
qjerome:mainfrom
tnaroska:fix/csv-min-rows

tnaroska commented May 23, 2026 •

edited

Loading

Uh oh!

tnaroska left a comment

Uh oh!

tnaroska May 23, 2026

Uh oh!

tnaroska May 23, 2026

Uh oh!

qjerome May 26, 2026

Uh oh!

tnaroska May 23, 2026

Uh oh!

qjerome May 26, 2026

Uh oh!

qjerome May 26, 2026

Uh oh!

qjerome May 26, 2026

Uh oh!

qjerome May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tnaroska commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproducer (before this PR)

Why the old threshold was wrong

What changed

Non-goals

Test plan

Uh oh!

tnaroska left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tnaroska commented May 23, 2026 •

edited

Loading