Add AlphaNum precision/recall/F-score metrics#56
Open
Jakob Drachmann Havtorn (JakobHavtorn) wants to merge 6 commits into
Open
Add AlphaNum precision/recall/F-score metrics#56Jakob Drachmann Havtorn (JakobHavtorn) wants to merge 6 commits into
Jakob Drachmann Havtorn (JakobHavtorn) wants to merge 6 commits into
Conversation
Introduces AlphaNumP, AlphaNumR, and AlphaNumF — precision, recall, and F-score over alphanumerical entities (initialisms, acronyms, chemical/unit notation, mixed-case medical/brand terms, digit-prefixed entities) detected via a Unicode-aware regex predicate over case-preserving tokens. No vocabulary input required; mirrors the existing KTP/KTR/KTF/_KTStats architecture but swaps the trie/vocab lookup for a stateless regex helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot started reviewing on behalf of
Jakob Drachmann Havtorn (JakobHavtorn)
May 28, 2026 19:02
View session
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #56 +/- ##
==========================================
+ Coverage 88.56% 88.90% +0.33%
==========================================
Files 47 52 +5
Lines 2782 3073 +291
Branches 342 372 +30
==========================================
+ Hits 2464 2732 +268
- Misses 237 252 +15
- Partials 81 89 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Replace the domain-flavoured examples (chemical/unit notation, medical/brand terms) with an explicit description of what the default regex actually matches. The metric is a general predicate over Token.raw, not a domain-specific detector — descriptions should reflect that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default tokenizer splits on `-`, so multi-token entities like CT-scan and X-ray were previously only partially detected (just `CT`) or missed entirely (`X` alone fails the length-≥2 rule). Extend the matcher to recognise hyphen-connected runs of tokens as compound candidates: - ALPHANUM_DEFAULT_PATTERN: allow `-` in the body and expand the negative lookahead to reject "ordinary capitalised compounds" (init-cap or lowercase parts joined by hyphens), so Hello-World, up-to-date, e-mail, mother-in-law, state-of-the-art remain rejected while CT-scan, X-ray, T-cell, pre-MRI, non-COVID, 5-HT, vitamin-D, MRI-CT, pre-COVID-19, Hello-MRI all match. - match_token_regex: identify runs of consecutive tokens whose source-text gaps consist of only hyphens, try a compound fullmatch against the joined standardized substring, and return a multi-token slice when it matches. Fall back to per-token matching within the run otherwise. - _AlphaNumStats multi-token slices work naturally with the existing alignment-based TP/FN/FP classification: the slice spans the same alignment ops it would as multiple single-token slices, so any edit inside the compound makes it FN, all-MATCH makes it TP. Tests added: 57 new cases — regex-level (compound match/reject/FP at fullmatch), helper-level (multi-token slicing, fallback, spaces disabling compounds, three-part compounds, mixed text), stats-level (perfect match, case lost, hyphen dropped (still TP), partial loss, spurious in hyp, ordinary capitalised compound stays at 0 entities). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
T-shirt, D-day, A-frame, S-curve are structurally identical to X-ray and are
correctly identified by the regex as case-distinctive. Same for shouted words
like THE, STOP, NO. They aren't "documented false positives" — they're true
positives of the metric as defined ("token contains a case signal"). Reframe
accordingly:
- Fold T-shirt/D-day/A-frame/S-curve into the compound matches list; remove
the separate "documented false positives" compound test.
- Fold THE/STOP/NO into the single-token matches list; remove the separate
single-token "documented false positives" test.
- Strip the apologetic "Note: ..." prefix from AlphaNumP's description on the
case-sensitivity caveat (it's a property of the metric, not a limitation
framed against the user's expectations).
No regex or matching behaviour changes — only test grouping and copy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two behavior changes:
1. Hyphen structure is now enforced strictly. Previously, ref CT-scan vs hyp
"CT scan" counted as TP because tokens [CT, scan] were identical on both
sides — the alignment is token-level and the hyphen is in the source-text
gap. Now: a multi-token compound ref match requires the corresponding hyp
tokens to also be hyphen-connected; otherwise FN. Symmetric on the hyp
side: a hyp compound with no matching hyphen structure in ref counts as
FP (hyphen invented). Plumbed via a new public helper
tokens_are_hyphen_connected in regex_match.py and an
hyp_token_idx-based check in _AlphaNumStats_._ref_match_classification
and fp_alignments.
2. Any token (or compound) containing a Greek letter is treated as an
entity, regardless of case. Adds a third branch to
ALPHANUM_DEFAULT_PATTERN: (?=.*\p{Greek})[\p{L}\d][-\p{L}\d]*. So α, β,
μ, μg, α-helix, β-blocker, γδ now all match. Previously rejected because
they had no uppercase letter and no digit; the new branch is independent
of those conditions.
Tests updated:
- test_compound_hyphen_dropped_keeps_tp → test_compound_hyphen_dropped_is_fn
- New test_hyp_invents_hyphen_is_fp covering the symmetric FP case
- New TestAlphaNumStatsGreekTokens covering single Greek letter, μg, and
Greek substitution in hyp
- α-helix / β-blocker moved from compound rejects to compound matches
- γ-radiation added to compound matches
- Single Greek tokens (α, β, γ, δ, μ, Δ, Ω) moved from rejects into a new
test_greek_tokens_match block
All 1069 tests pass; pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
THE, HELLO, STOP, NO, OK all match the regex's case-distinctive rule (≥2 uppercase letters, not init-cap-word shape) but are semantically just shouted ordinary words, not abbreviations. Re-add the test to explicitly label them as documented false positives — making the trade-off legible rather than silently mixed into the main matches list. T-shirt, D-day, A-frame, S-curve remain in the compound matches list (true positives — structurally identical to X-ray, indistinguishable without a vocabulary). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AlphaNumP/AlphaNumR/AlphaNumF— precision, recall, and F-score over alphanumerical entities, initialisms, and acronyms (via upper-case terms, mixed-case terms, alphanumerical terms, digit-prefixed terms). No vocabulary input required.KTP/KTR/KTF/_KTStatsarchitecture but swaps the trie/vocabulary lookup for a stateless regex helper (bewer.preprocessing.regex_match.match_token_regex) overToken.raw, so detection is case-preserving and works on any dataset out of the box.Patient,Hello,The) and ordinals across English / French / Dutch (1st,1er,1e) — no exclusion list. Override viapattern=…._KTStats's alignment-based classification; usesnormalized=FalseLevenshtein so case mismatches (e.g. refMRIvs hypmri) are surfaced as FN + FP rather than silently passing.Linear
SR-2628