Fix six ReadTntCharacters parse failures on real-world TNT files by ms609 · Pull Request #268 · ms609/TreeTools

ms609 · 2026-05-15T08:57:15Z

Summary

Five commits fix six documented ReadTntCharacters() failure modes encountered when parsing the Goloboff et al. (2019, Syst. Biol.) corpus of 13 morphological matrices. Empirically: before these fixes, 8/13 matrices parsed; after, 11/13 parse cleanly and 2 more (characidae, dromaeodat) return data but with a residual TAXA-MISMATCH warning (a follow-up branch addresses those).

Commit	Bug	Root cause	Fix
`25160c70`	`invalid substring arguments` on files with multi-line TNT comments (dinosaurs, wasps)	Inner lines of multi-line `'...'` comments not blanked; their `;` corrupted `xreadEnd`	Blank `lines[innerStart:innerEnd]` for each open/close comment pair
`ccc572c4`	`max returning -Inf` on files using `&` block-separator (beetles)	Bare `&` lines survived into `matrixLines`; first line matched no taxon pattern → empty `taxonLineNumber`	Strip `^&\s*$` lines from `matrixLines` unconditionally
`af1ccb0a`	`max returning -Inf` on multi-line-taxon files (characidae, dionychans)	Taxon name on its own line didn't match `name + space + data` pattern	Add secondary `nameOnly.pattern` in `ExtractTaxa`; emit `""` token contribution
`6fa92528`	Returns `NULL` on dromaeodat-style files	`^XREAD\b` anchor failed when `xread` appeared mid-line after other `;`-separated directives	Relax grep to `\bXREAD\b`; strip pre-`xread` portion from the matching line
`adc0bfd4`	Taxon names contain `_@Family_Genus_...` garbage	TNT `taxonomy=;` directive appends `@taxonomy` suffix to names	Strip `@\S*$` and trailing `_+` from `taxa` after name extraction

Test plan

5 new test fixtures under inst/extdata/tests/; one new test block per failure mode in test-ReadTntTree.R
devtools::load_all() + testthat::test_file('tests/testthat/test-ReadTntTree.R') — 42 dots, no regressions
Empirical re-parse against the actual Goloboff corpus (13 .tnt files): 11 clean parses, 2 partial (residual TAXA-MISMATCH on characidae, dromaeodat — addressed in follow-up)
Existing test-parsers.R untouched, still passing

🤖 Generated with Claude Code

Blank inner lines between open/close multi-line comment markers so that semicolons and text within comments no longer corrupt xreadEnd detection or appear as spurious matrix content (fixes dinosaurs- and wasps-style failures). Guarded with innerStart <= innerEnd to avoid descending seq on adjacent-line comments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bare & lines (used as block separators in multi-segment TNT files like beetles.tnt) were not removed before ExtractTaxa, causing max(integer(0)) = -Inf and a vapply type error. Remove them unconditionally before ctypeLines processing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TNT files from e.g. characidae/dionychans place the taxon name alone on its own line, with character data on subsequent lines. Introduce a secondary nameOnly.pattern that recognises single-token lines starting with a letter as taxon starts, and explicitly set their token contribution to empty so data lines are correctly concatenated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TNT files from dromaeodat-style datasets emit all directives on a single semicolon-separated line (e.g. 'piwe=; mxr 100 ; ... ; xread 853 164'). Relax the grep anchor from ^XREAD to \bXREAD\b and strip the pre-xread portion of the line so dimension parsing and matrix extraction work correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TNT files with 'taxonomy=;' append '@Family_Genus_...' to each taxon name. Strip the '@...' suffix (and any trailing underscores) in ExtractTaxa so that rownames reflect the plain taxon name only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-05-15T09:08:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.07%. Comparing base (4e5af7a) to head (adc0bfd).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #268      +/-   ##
==========================================
+ Coverage   96.06%   96.07%   +0.01%     
==========================================
  Files          80       80              
  Lines        5890     5905      +15     
==========================================
+ Hits         5658     5673      +15     
  Misses        232      232

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-05-15T09:11:20Z

Performance benchmark results

Call	Status	Change	Time (ms)
`as.Splits(bigTrees)`	⚪ NSD	-5.28%	22.5 → 24, 23.5
`as.Splits(someTrees)`	⚪ NSD	-0.32%	11.1 → 11, 11.2
`Consensus(forest1k.888, check = FALSE)`	⚪ NSD	-1.91%	101 → 98, 107
`Consensus(forest201.80, check = FALSE)`	⚪ NSD	-4.74%	4.05 → 4.01, 4.46
`Consensus(forest21.260, 0.5, FALSE)`	⚪ NSD	-0.16%	1.25 → 1.24, 1.27
`Consensus(forest21.260)`	⚪ NSD	-0.86%	1.26 → 1.25, 1.29
`Consensus(forestMaj, 0.5, FALSE)`	⚪ NSD	-0.46%	2.99 → 2.93, 3.07
`DropTip(tr2000, 5)`	⚪ NSD	-0.99%	16.6 → 16.3, 17.6
`DropTip(tr80, 5)`	⚪ NSD	-1.73%	0.102 → 0.102, 0.105
`DropTip(unlen2k, 5)`	⚪ NSD	2.61%	0.211 → 0.202, 0.208
`DropTip(unlen80, 5)`	⚪ NSD	-0.1%	0.0399 → 0.0396, 0.0402
`lapply(bigSplits, as.phylo)`	⚪ NSD	0.63%	29.5 → 29.4, 29.4
`lapply(someSplits, as.phylo)`	⚪ NSD	0.35%	13.9 → 13.5, 13.9
`PathLengths(tr2000, full = TRUE)`	⚪ NSD	-0.65%	15.7 → 15.7, 16.2
`PathLengths(tr80, full = TRUE)`	⚪ NSD	-0.51%	0.102 → 0.102, 0.104
`PathLengths(tr80Unif, full = TRUE)`	⚪ NSD	-1.32%	0.104 → 0.104, 0.106
`RootTree(tr2000, 5)`	⚪ NSD	-1.95%	0.389 → 0.391, 0.402
`RootTree(tr80, c("t3", "t36"))`	⚪ NSD	-2%	0.0704 → 0.0707, 0.0728
`RootTree(tr80, "t3")`	⚪ NSD	-2.53%	0.0498 → 0.0506, 0.0516
`RootTree(tr80, "t30")`	⚪ NSD	-2.37%	0.0498 → 0.051, 0.0511
`RootTree(unlen2k, 5)`	⚪ NSD	0.61%	0.344 → 0.344, 0.336
`RootTree(unlen80, c("t3", "t36"))`	⚪ NSD	-1.29%	0.0655 → 0.0658, 0.0669
`RootTree(unlen80, "t3")`	⚪ NSD	-1.17%	0.0438 → 0.0445, 0.0439
`RootTree(unlen80, "t30")`	⚪ NSD	-0.77%	0.0442 → 0.0447, 0.0443
`TreeDist::RobinsonFoulds(forest201.80)`	⚪ NSD	-5.4%	16.2 → 15.9, 17.5
`TreeDist::RobinsonFoulds(forest21.888)`	⚪ NSD	-4.81%	3.4 → 3.43, 3.6
`TreeTools:::path_lengths(tr80$edge, tr80$edge.length, FALSE)`	⚪ NSD	-0.39%	0.0933 → 0.0929, 0.0943
`TreeTools:::postorder_order(bal40)`	⚪ NSD	1.79%	0.00169 → 0.00166, 0.00165
`TreeTools:::postorder_order(bal40k)`	⚪ NSD	-1.4%	0.475 → 0.478, 0.489
`TreeTools:::postorder_order(dbal40)`	⚪ NSD	2.28%	0.00175 → 0.00171, 0.0017
`TreeTools:::postorder_order(dbal40k)`	⚪ NSD	0.13%	2.04 → 2.02, 2.05
`TreeTools:::postorder_order(dpec40)`	⚪ NSD	1.18%	0.00255 → 0.00252, 0.00253
`TreeTools:::postorder_order(dpec40k)`	⚪ NSD	0.02%	3300 → 3290, 3300
`TreeTools:::postorder_order(drnd80)`	⚪ NSD	0%	0.00405 → 0.00405, 0.00405
`TreeTools:::postorder_order(nbal40)`	⚪ NSD	1.06%	0.00208 → 0.00204, 0.00206
`TreeTools:::postorder_order(nbal40k)`	⚪ NSD	-1.06%	2.17 → 2.18, 2.21
`TreeTools:::postorder_order(npec40)`	⚪ NSD	0.38%	0.00284 → 0.00283, 0.00282
`TreeTools:::postorder_order(npec40k)`	⚪ NSD	-0.29%	3310 → 3320, 3330
`TreeTools:::postorder_order(nrnd80)`	⚪ NSD	-0.87%	0.00457 → 0.00462, 0.00458
`TreeTools:::postorder_order(pec40)`	⚪ NSD	2.39%	0.00167 → 0.00163, 0.00163
`TreeTools:::postorder_order(pec40k)`	⚪ NSD	-26.57%	0.425 → 0.536, 0.543
`TreeTools:::postorder_order(rnd80)`	⚪ NSD	0.94%	0.00212 → 0.0021, 0.00211

.UTFLines() was falling back to latin1 when UTF-8 decoding failed. Latin1 maps bytes 0x80-0x9F to C1 control characters (U+0080-U+009F), while Windows-1252 maps them to printable characters (e.g. 0x91/0x92 become U+2018/U+2019 curly quotes). The nameOnly.pattern uses \p{Pi} which matches U+2018 but not U+0091, so smart-quote taxon names in Windows-1252 files (e.g. characidae.tnt, 4 taxa) were silently dropped. Change fallback encoding to cp1252, add test fixture and regression test. Fixes #268 (characidae.tnt: 156 → 160 taxa) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Some TNT files (e.g. dromaeodat.tnt) write 2–3 taxa on one physical line with no whitespace separator: character data ends and the next taxon name begins immediately (e.g. `???Taxon_b@clade`). The parser only captured the first taxon on each such line, leaving 13 taxa missing from dromaeodat.tnt (151 parsed vs. 164 declared). Add a pre-processing step in ReadTntCharacters() that detects these concatenated lines via a zero-width lookbehind/lookahead split and breaks them into individual taxon lines before passing to ExtractTaxa. Fixes #268 (dromaeodat.tnt: 151 → 164 taxa) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

.UTFLines() was falling back to latin1 when UTF-8 decoding failed. Latin1 maps bytes 0x80-0x9F to C1 control characters (U+0080-U+009F), while Windows-1252 maps them to printable characters (e.g. 0x91/0x92 become U+2018/U+2019 curly quotes). The nameOnly.pattern uses \p{Pi} which matches U+2018 but not U+0091, so smart-quote taxon names in Windows-1252 files (e.g. characidae.tnt, 4 taxa) were silently dropped. Change fallback encoding to cp1252, add test fixture and regression test. Fixes #268 (characidae.tnt: 156 → 160 taxa) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Some TNT files (e.g. dromaeodat.tnt) write 2–3 taxa on one physical line with no whitespace separator: character data ends and the next taxon name begins immediately (e.g. `???Taxon_b@clade`). The parser only captured the first taxon on each such line, leaving 13 taxa missing from dromaeodat.tnt (151 parsed vs. 164 declared). Add a pre-processing step in ReadTntCharacters() that detects these concatenated lines via a zero-width lookbehind/lookahead split and breaks them into individual taxon lines before passing to ExtractTaxa. Fixes #268 (dromaeodat.tnt: 151 → 164 taxa) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Resolve conflicts from PR #268 (TNT parser fixes) landing on main while this branch was open: - DESCRIPTION: bump to 2.3.0.9002 (past main's 9001); drop duplicate Config/roxygen2/version line introduced by the merge. - NEWS.md: consolidate dev entries from both sides under the new header. - inst/extdata/tests/tnt-*.tnt, tests/testthat/test-ReadTntTree.R: keep main's versions; the branch's earlier deletions were premature. - man/*.Rd: regenerated via devtools::document() so @family references include both NexusTokensToInteger() (this branch) and the TNT additions from main. R/parse_files.R auto-merged cleanly. Full devtools::test() green.

* Harden NexusTokens for Cingulata-style polymorphism with internal whitespace No parser fix was needed: the existing gsub(" ", "", ...) at parse_files.R:88 already strips internal whitespace from polymorphism tokens before NexusTokens() sees them, so (1 2) -> (12) and {0 1} -> {01} already worked correctly. Added: - Regression test covering (1 2) / {0 1} polymorphism and multi-line matrix continuation in test-parsers.R. - NexusTokensToInteger(): a new exported helper that converts the character matrix from ReadCharacters() to integer, mapping polymorphic/ambiguous/?/- tokens to NA_integer_ by default, or extracting the first/last state digit under polymorphism = "first"/"last". Tests included. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * redoc w/ roxygen8 * Complete impl * Merge origin/main into fix-polymorphism-whitespace Resolve conflicts from PR #268 (TNT parser fixes) landing on main while this branch was open: - DESCRIPTION: bump to 2.3.0.9002 (past main's 9001); drop duplicate Config/roxygen2/version line introduced by the merge. - NEWS.md: consolidate dev entries from both sides under the new header. - inst/extdata/tests/tnt-*.tnt, tests/testthat/test-ReadTntTree.R: keep main's versions; the branch's earlier deletions were premature. - man/*.Rd: regenerated via devtools::document() so @family references include both NexusTokensToInteger() (this branch) and the TNT additions from main. R/parse_files.R auto-merged cleanly. Full devtools::test() green. * sp --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

ms609 and others added 5 commits May 15, 2026 09:35

ms609 merged commit ee67ba8 into main May 15, 2026
35 of 36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix six ReadTntCharacters parse failures on real-world TNT files#268

Fix six ReadTntCharacters parse failures on real-world TNT files#268
ms609 merged 5 commits into
mainfrom
tnt-parse-fixes

ms609 commented May 15, 2026

Uh oh!

codecov Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ms609 commented May 15, 2026

Summary

Test plan

Uh oh!

codecov Bot commented May 15, 2026

Codecov Report

Uh oh!

github-actions Bot commented May 15, 2026

Performance benchmark results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant