Skip to content

Fix six ReadTntCharacters parse failures on real-world TNT files#268

Merged
ms609 merged 5 commits into
mainfrom
tnt-parse-fixes
May 15, 2026
Merged

Fix six ReadTntCharacters parse failures on real-world TNT files#268
ms609 merged 5 commits into
mainfrom
tnt-parse-fixes

Conversation

@ms609

@ms609 ms609 commented May 15, 2026

Copy link
Copy Markdown
Owner

Summary

Five commits fix six documented ReadTntCharacters() failure modes encountered when parsing the Goloboff et al. (2019, Syst. Biol.) corpus of 13 morphological matrices. Empirically: before these fixes, 8/13 matrices parsed; after, 11/13 parse cleanly and 2 more (characidae, dromaeodat) return data but with a residual TAXA-MISMATCH warning (a follow-up branch addresses those).

Commit Bug Root cause Fix
25160c70 invalid substring arguments on files with multi-line TNT comments (dinosaurs, wasps) Inner lines of multi-line '...' comments not blanked; their ; corrupted xreadEnd Blank lines[innerStart:innerEnd] for each open/close comment pair
ccc572c4 max returning -Inf on files using & block-separator (beetles) Bare & lines survived into matrixLines; first line matched no taxon pattern → empty taxonLineNumber Strip ^&\s*$ lines from matrixLines unconditionally
af1ccb0a max returning -Inf on multi-line-taxon files (characidae, dionychans) Taxon name on its own line didn't match name + space + data pattern Add secondary nameOnly.pattern in ExtractTaxa; emit "" token contribution
6fa92528 Returns NULL on dromaeodat-style files ^XREAD\b anchor failed when xread appeared mid-line after other ;-separated directives Relax grep to \bXREAD\b; strip pre-xread portion from the matching line
adc0bfd4 Taxon names contain _@Family_Genus_... garbage TNT taxonomy=; directive appends @taxonomy suffix to names Strip @\S*$ and trailing _+ from taxa after name extraction

Test plan

  • 5 new test fixtures under inst/extdata/tests/; one new test block per failure mode in test-ReadTntTree.R
  • devtools::load_all() + testthat::test_file('tests/testthat/test-ReadTntTree.R') — 42 dots, no regressions
  • Empirical re-parse against the actual Goloboff corpus (13 .tnt files): 11 clean parses, 2 partial (residual TAXA-MISMATCH on characidae, dromaeodat — addressed in follow-up)
  • Existing test-parsers.R untouched, still passing

🤖 Generated with Claude Code

ms609 and others added 5 commits May 15, 2026 09:35
Blank inner lines between open/close multi-line comment markers so that
semicolons and text within comments no longer corrupt xreadEnd detection
or appear as spurious matrix content (fixes dinosaurs- and wasps-style
failures). Guarded with innerStart <= innerEnd to avoid descending
seq on adjacent-line comments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bare & lines (used as block separators in multi-segment TNT files
like beetles.tnt) were not removed before ExtractTaxa, causing
max(integer(0)) = -Inf and a vapply type error. Remove them
unconditionally before ctypeLines processing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TNT files from e.g. characidae/dionychans place the taxon name alone on
its own line, with character data on subsequent lines. Introduce a
secondary nameOnly.pattern that recognises single-token lines starting
with a letter as taxon starts, and explicitly set their token contribution
to empty so data lines are correctly concatenated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TNT files from dromaeodat-style datasets emit all directives on a single
semicolon-separated line (e.g. 'piwe=; mxr 100 ; ... ; xread 853 164').
Relax the grep anchor from ^XREAD to \bXREAD\b and strip the pre-xread
portion of the line so dimension parsing and matrix extraction work
correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TNT files with 'taxonomy=;' append '@Family_Genus_...' to each taxon
name. Strip the '@...' suffix (and any trailing underscores) in
ExtractTaxa so that rownames reflect the plain taxon name only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov

codecov Bot commented May 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.07%. Comparing base (4e5af7a) to head (adc0bfd).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #268      +/-   ##
==========================================
+ Coverage   96.06%   96.07%   +0.01%     
==========================================
  Files          80       80              
  Lines        5890     5905      +15     
==========================================
+ Hits         5658     5673      +15     
  Misses        232      232              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

Copy link
Copy Markdown

Performance benchmark results

Call Status Change Time (ms)
as.Splits(bigTrees) ⚪ NSD -5.28% 22.5 →
24, 23.5
as.Splits(someTrees) ⚪ NSD -0.32% 11.1 →
11, 11.2
Consensus(forest1k.888, check = FALSE) ⚪ NSD -1.91% 101 →
98, 107
Consensus(forest201.80, check = FALSE) ⚪ NSD -4.74% 4.05 →
4.01, 4.46
Consensus(forest21.260, 0.5, FALSE) ⚪ NSD -0.16% 1.25 →
1.24, 1.27
Consensus(forest21.260) ⚪ NSD -0.86% 1.26 →
1.25, 1.29
Consensus(forestMaj, 0.5, FALSE) ⚪ NSD -0.46% 2.99 →
2.93, 3.07
DropTip(tr2000, 5) ⚪ NSD -0.99% 16.6 →
16.3, 17.6
DropTip(tr80, 5) ⚪ NSD -1.73% 0.102 →
0.102, 0.105
DropTip(unlen2k, 5) ⚪ NSD 2.61% 0.211 →
0.202, 0.208
DropTip(unlen80, 5) ⚪ NSD -0.1% 0.0399 →
0.0396, 0.0402
lapply(bigSplits, as.phylo) ⚪ NSD 0.63% 29.5 →
29.4, 29.4
lapply(someSplits, as.phylo) ⚪ NSD 0.35% 13.9 →
13.5, 13.9
PathLengths(tr2000, full = TRUE) ⚪ NSD -0.65% 15.7 →
15.7, 16.2
PathLengths(tr80, full = TRUE) ⚪ NSD -0.51% 0.102 →
0.102, 0.104
PathLengths(tr80Unif, full = TRUE) ⚪ NSD -1.32% 0.104 →
0.104, 0.106
RootTree(tr2000, 5) ⚪ NSD -1.95% 0.389 →
0.391, 0.402
RootTree(tr80, c("t3", "t36")) ⚪ NSD -2% 0.0704 →
0.0707, 0.0728
RootTree(tr80, "t3") ⚪ NSD -2.53% 0.0498 →
0.0506, 0.0516
RootTree(tr80, "t30") ⚪ NSD -2.37% 0.0498 →
0.051, 0.0511
RootTree(unlen2k, 5) ⚪ NSD 0.61% 0.344 →
0.344, 0.336
RootTree(unlen80, c("t3", "t36")) ⚪ NSD -1.29% 0.0655 →
0.0658, 0.0669
RootTree(unlen80, "t3") ⚪ NSD -1.17% 0.0438 →
0.0445, 0.0439
RootTree(unlen80, "t30") ⚪ NSD -0.77% 0.0442 →
0.0447, 0.0443
TreeDist::RobinsonFoulds(forest201.80) ⚪ NSD -5.4% 16.2 →
15.9, 17.5
TreeDist::RobinsonFoulds(forest21.888) ⚪ NSD -4.81% 3.4 →
3.43, 3.6
TreeTools:::path_lengths(tr80$edge, tr80$edge.length, FALSE) ⚪ NSD -0.39% 0.0933 →
0.0929, 0.0943
TreeTools:::postorder_order(bal40) ⚪ NSD 1.79% 0.00169 →
0.00166, 0.00165
TreeTools:::postorder_order(bal40k) ⚪ NSD -1.4% 0.475 →
0.478, 0.489
TreeTools:::postorder_order(dbal40) ⚪ NSD 2.28% 0.00175 →
0.00171, 0.0017
TreeTools:::postorder_order(dbal40k) ⚪ NSD 0.13% 2.04 →
2.02, 2.05
TreeTools:::postorder_order(dpec40) ⚪ NSD 1.18% 0.00255 →
0.00252, 0.00253
TreeTools:::postorder_order(dpec40k) ⚪ NSD 0.02% 3300 →
3290, 3300
TreeTools:::postorder_order(drnd80) ⚪ NSD 0% 0.00405 →
0.00405, 0.00405
TreeTools:::postorder_order(nbal40) ⚪ NSD 1.06% 0.00208 →
0.00204, 0.00206
TreeTools:::postorder_order(nbal40k) ⚪ NSD -1.06% 2.17 →
2.18, 2.21
TreeTools:::postorder_order(npec40) ⚪ NSD 0.38% 0.00284 →
0.00283, 0.00282
TreeTools:::postorder_order(npec40k) ⚪ NSD -0.29% 3310 →
3320, 3330
TreeTools:::postorder_order(nrnd80) ⚪ NSD -0.87% 0.00457 →
0.00462, 0.00458
TreeTools:::postorder_order(pec40) ⚪ NSD 2.39% 0.00167 →
0.00163, 0.00163
TreeTools:::postorder_order(pec40k) ⚪ NSD -26.57% 0.425 →
0.536, 0.543
TreeTools:::postorder_order(rnd80) ⚪ NSD 0.94% 0.00212 →
0.0021, 0.00211

@ms609 ms609 merged commit ee67ba8 into main May 15, 2026
35 of 36 checks passed
ms609 added a commit that referenced this pull request May 15, 2026
.UTFLines() was falling back to latin1 when UTF-8 decoding failed.
Latin1 maps bytes 0x80-0x9F to C1 control characters (U+0080-U+009F),
while Windows-1252 maps them to printable characters (e.g. 0x91/0x92
become U+2018/U+2019 curly quotes). The nameOnly.pattern uses \p{Pi}
which matches U+2018 but not U+0091, so smart-quote taxon names in
Windows-1252 files (e.g. characidae.tnt, 4 taxa) were silently dropped.

Change fallback encoding to cp1252, add test fixture and regression test.

Fixes #268 (characidae.tnt: 156 → 160 taxa)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ms609 added a commit that referenced this pull request May 15, 2026
Some TNT files (e.g. dromaeodat.tnt) write 2–3 taxa on one physical
line with no whitespace separator: character data ends and the next
taxon name begins immediately (e.g. `???Taxon_b@clade`). The parser
only captured the first taxon on each such line, leaving 13 taxa
missing from dromaeodat.tnt (151 parsed vs. 164 declared).

Add a pre-processing step in ReadTntCharacters() that detects these
concatenated lines via a zero-width lookbehind/lookahead split and
breaks them into individual taxon lines before passing to ExtractTaxa.

Fixes #268 (dromaeodat.tnt: 151 → 164 taxa)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ms609 added a commit that referenced this pull request May 15, 2026
.UTFLines() was falling back to latin1 when UTF-8 decoding failed.
Latin1 maps bytes 0x80-0x9F to C1 control characters (U+0080-U+009F),
while Windows-1252 maps them to printable characters (e.g. 0x91/0x92
become U+2018/U+2019 curly quotes). The nameOnly.pattern uses \p{Pi}
which matches U+2018 but not U+0091, so smart-quote taxon names in
Windows-1252 files (e.g. characidae.tnt, 4 taxa) were silently dropped.

Change fallback encoding to cp1252, add test fixture and regression test.

Fixes #268 (characidae.tnt: 156 → 160 taxa)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ms609 added a commit that referenced this pull request May 15, 2026
Some TNT files (e.g. dromaeodat.tnt) write 2–3 taxa on one physical
line with no whitespace separator: character data ends and the next
taxon name begins immediately (e.g. `???Taxon_b@clade`). The parser
only captured the first taxon on each such line, leaving 13 taxa
missing from dromaeodat.tnt (151 parsed vs. 164 declared).

Add a pre-processing step in ReadTntCharacters() that detects these
concatenated lines via a zero-width lookbehind/lookahead split and
breaks them into individual taxon lines before passing to ExtractTaxa.

Fixes #268 (dromaeodat.tnt: 151 → 164 taxa)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ms609 added a commit that referenced this pull request May 18, 2026
Resolve conflicts from PR #268 (TNT parser fixes) landing on main while
this branch was open:

- DESCRIPTION: bump to 2.3.0.9002 (past main's 9001); drop duplicate
  Config/roxygen2/version line introduced by the merge.
- NEWS.md: consolidate dev entries from both sides under the new header.
- inst/extdata/tests/tnt-*.tnt, tests/testthat/test-ReadTntTree.R: keep
  main's versions; the branch's earlier deletions were premature.
- man/*.Rd: regenerated via devtools::document() so @family references
  include both NexusTokensToInteger() (this branch) and the TNT
  additions from main.

R/parse_files.R auto-merged cleanly. Full devtools::test() green.
ms609 added a commit that referenced this pull request May 18, 2026
* Harden NexusTokens for Cingulata-style polymorphism with internal whitespace

No parser fix was needed: the existing gsub(" ", "", ...) at parse_files.R:88
already strips internal whitespace from polymorphism tokens before NexusTokens()
sees them, so (1 2) -> (12) and {0 1} -> {01} already worked correctly.

Added:
- Regression test covering (1 2) / {0 1} polymorphism and multi-line matrix
  continuation in test-parsers.R.
- NexusTokensToInteger(): a new exported helper that converts the character
  matrix from ReadCharacters() to integer, mapping polymorphic/ambiguous/?/-
  tokens to NA_integer_ by default, or extracting the first/last state digit
  under polymorphism = "first"/"last". Tests included.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* redoc w/ roxygen8

* Complete impl

* Merge origin/main into fix-polymorphism-whitespace

Resolve conflicts from PR #268 (TNT parser fixes) landing on main while
this branch was open:

- DESCRIPTION: bump to 2.3.0.9002 (past main's 9001); drop duplicate
  Config/roxygen2/version line introduced by the merge.
- NEWS.md: consolidate dev entries from both sides under the new header.
- inst/extdata/tests/tnt-*.tnt, tests/testthat/test-ReadTntTree.R: keep
  main's versions; the branch's earlier deletions were premature.
- man/*.Rd: regenerated via devtools::document() so @family references
  include both NexusTokensToInteger() (this branch) and the TNT
  additions from main.

R/parse_files.R auto-merged cleanly. Full devtools::test() green.

* sp

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant