Fix six ReadTntCharacters parse failures on real-world TNT files#268
Merged
Conversation
Blank inner lines between open/close multi-line comment markers so that semicolons and text within comments no longer corrupt xreadEnd detection or appear as spurious matrix content (fixes dinosaurs- and wasps-style failures). Guarded with innerStart <= innerEnd to avoid descending seq on adjacent-line comments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bare & lines (used as block separators in multi-segment TNT files like beetles.tnt) were not removed before ExtractTaxa, causing max(integer(0)) = -Inf and a vapply type error. Remove them unconditionally before ctypeLines processing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TNT files from e.g. characidae/dionychans place the taxon name alone on its own line, with character data on subsequent lines. Introduce a secondary nameOnly.pattern that recognises single-token lines starting with a letter as taxon starts, and explicitly set their token contribution to empty so data lines are correctly concatenated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TNT files from dromaeodat-style datasets emit all directives on a single semicolon-separated line (e.g. 'piwe=; mxr 100 ; ... ; xread 853 164'). Relax the grep anchor from ^XREAD to \bXREAD\b and strip the pre-xread portion of the line so dimension parsing and matrix extraction work correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TNT files with 'taxonomy=;' append '@Family_Genus_...' to each taxon name. Strip the '@...' suffix (and any trailing underscores) in ExtractTaxa so that rownames reflect the plain taxon name only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #268 +/- ##
==========================================
+ Coverage 96.06% 96.07% +0.01%
==========================================
Files 80 80
Lines 5890 5905 +15
==========================================
+ Hits 5658 5673 +15
Misses 232 232 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Performance benchmark results
|
ms609
added a commit
that referenced
this pull request
May 15, 2026
.UTFLines() was falling back to latin1 when UTF-8 decoding failed.
Latin1 maps bytes 0x80-0x9F to C1 control characters (U+0080-U+009F),
while Windows-1252 maps them to printable characters (e.g. 0x91/0x92
become U+2018/U+2019 curly quotes). The nameOnly.pattern uses \p{Pi}
which matches U+2018 but not U+0091, so smart-quote taxon names in
Windows-1252 files (e.g. characidae.tnt, 4 taxa) were silently dropped.
Change fallback encoding to cp1252, add test fixture and regression test.
Fixes #268 (characidae.tnt: 156 → 160 taxa)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ms609
added a commit
that referenced
this pull request
May 15, 2026
Some TNT files (e.g. dromaeodat.tnt) write 2–3 taxa on one physical line with no whitespace separator: character data ends and the next taxon name begins immediately (e.g. `???Taxon_b@clade`). The parser only captured the first taxon on each such line, leaving 13 taxa missing from dromaeodat.tnt (151 parsed vs. 164 declared). Add a pre-processing step in ReadTntCharacters() that detects these concatenated lines via a zero-width lookbehind/lookahead split and breaks them into individual taxon lines before passing to ExtractTaxa. Fixes #268 (dromaeodat.tnt: 151 → 164 taxa) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ms609
added a commit
that referenced
this pull request
May 15, 2026
.UTFLines() was falling back to latin1 when UTF-8 decoding failed.
Latin1 maps bytes 0x80-0x9F to C1 control characters (U+0080-U+009F),
while Windows-1252 maps them to printable characters (e.g. 0x91/0x92
become U+2018/U+2019 curly quotes). The nameOnly.pattern uses \p{Pi}
which matches U+2018 but not U+0091, so smart-quote taxon names in
Windows-1252 files (e.g. characidae.tnt, 4 taxa) were silently dropped.
Change fallback encoding to cp1252, add test fixture and regression test.
Fixes #268 (characidae.tnt: 156 → 160 taxa)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ms609
added a commit
that referenced
this pull request
May 15, 2026
Some TNT files (e.g. dromaeodat.tnt) write 2–3 taxa on one physical line with no whitespace separator: character data ends and the next taxon name begins immediately (e.g. `???Taxon_b@clade`). The parser only captured the first taxon on each such line, leaving 13 taxa missing from dromaeodat.tnt (151 parsed vs. 164 declared). Add a pre-processing step in ReadTntCharacters() that detects these concatenated lines via a zero-width lookbehind/lookahead split and breaks them into individual taxon lines before passing to ExtractTaxa. Fixes #268 (dromaeodat.tnt: 151 → 164 taxa) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ms609
added a commit
that referenced
this pull request
May 18, 2026
Resolve conflicts from PR #268 (TNT parser fixes) landing on main while this branch was open: - DESCRIPTION: bump to 2.3.0.9002 (past main's 9001); drop duplicate Config/roxygen2/version line introduced by the merge. - NEWS.md: consolidate dev entries from both sides under the new header. - inst/extdata/tests/tnt-*.tnt, tests/testthat/test-ReadTntTree.R: keep main's versions; the branch's earlier deletions were premature. - man/*.Rd: regenerated via devtools::document() so @family references include both NexusTokensToInteger() (this branch) and the TNT additions from main. R/parse_files.R auto-merged cleanly. Full devtools::test() green.
ms609
added a commit
that referenced
this pull request
May 18, 2026
* Harden NexusTokens for Cingulata-style polymorphism with internal whitespace
No parser fix was needed: the existing gsub(" ", "", ...) at parse_files.R:88
already strips internal whitespace from polymorphism tokens before NexusTokens()
sees them, so (1 2) -> (12) and {0 1} -> {01} already worked correctly.
Added:
- Regression test covering (1 2) / {0 1} polymorphism and multi-line matrix
continuation in test-parsers.R.
- NexusTokensToInteger(): a new exported helper that converts the character
matrix from ReadCharacters() to integer, mapping polymorphic/ambiguous/?/-
tokens to NA_integer_ by default, or extracting the first/last state digit
under polymorphism = "first"/"last". Tests included.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* redoc w/ roxygen8
* Complete impl
* Merge origin/main into fix-polymorphism-whitespace
Resolve conflicts from PR #268 (TNT parser fixes) landing on main while
this branch was open:
- DESCRIPTION: bump to 2.3.0.9002 (past main's 9001); drop duplicate
Config/roxygen2/version line introduced by the merge.
- NEWS.md: consolidate dev entries from both sides under the new header.
- inst/extdata/tests/tnt-*.tnt, tests/testthat/test-ReadTntTree.R: keep
main's versions; the branch's earlier deletions were premature.
- man/*.Rd: regenerated via devtools::document() so @family references
include both NexusTokensToInteger() (this branch) and the TNT
additions from main.
R/parse_files.R auto-merged cleanly. Full devtools::test() green.
* sp
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five commits fix six documented
ReadTntCharacters()failure modes encountered when parsing the Goloboff et al. (2019, Syst. Biol.) corpus of 13 morphological matrices. Empirically: before these fixes, 8/13 matrices parsed; after, 11/13 parse cleanly and 2 more (characidae, dromaeodat) return data but with a residual TAXA-MISMATCH warning (a follow-up branch addresses those).25160c70invalid substring argumentson files with multi-line TNT comments (dinosaurs, wasps)'...'comments not blanked; their;corruptedxreadEndlines[innerStart:innerEnd]for each open/close comment pairccc572c4max returning -Infon files using&block-separator (beetles)&lines survived intomatrixLines; first line matched no taxon pattern → emptytaxonLineNumber^&\s*$lines frommatrixLinesunconditionallyaf1ccb0amax returning -Infon multi-line-taxon files (characidae, dionychans)name + space + datapatternnameOnly.patterninExtractTaxa; emit""token contribution6fa92528NULLon dromaeodat-style files^XREAD\banchor failed whenxreadappeared mid-line after other;-separated directives\bXREAD\b; strip pre-xreadportion from the matching lineadc0bfd4_@Family_Genus_...garbagetaxonomy=;directive appends@taxonomysuffix to names@\S*$and trailing_+fromtaxaafter name extractionTest plan
inst/extdata/tests/; one new test block per failure mode intest-ReadTntTree.Rdevtools::load_all()+testthat::test_file('tests/testthat/test-ReadTntTree.R')— 42 dots, no regressionstest-parsers.Runtouched, still passing🤖 Generated with Claude Code