Skip to content

Add AlphaNum precision/recall/F-score metrics#56

Open
Jakob Drachmann Havtorn (JakobHavtorn) wants to merge 6 commits into
mainfrom
add-alphanum-metrics-sr-2628
Open

Add AlphaNum precision/recall/F-score metrics#56
Jakob Drachmann Havtorn (JakobHavtorn) wants to merge 6 commits into
mainfrom
add-alphanum-metrics-sr-2628

Conversation

@JakobHavtorn
Copy link
Copy Markdown

@JakobHavtorn Jakob Drachmann Havtorn (JakobHavtorn) commented May 28, 2026

Summary

  • Addresses SR-2628 by adding AlphaNumP / AlphaNumR / AlphaNumF — precision, recall, and F-score over alphanumerical entities, initialisms, and acronyms (via upper-case terms, mixed-case terms, alphanumerical terms, digit-prefixed terms). No vocabulary input required.
  • Mirrors the existing KTP / KTR / KTF / _KTStats architecture but swaps the trie/vocabulary lookup for a stateless regex helper (bewer.preprocessing.regex_match.match_token_regex) over Token.raw, so detection is case-preserving and works on any dataset out of the box.
  • Default Unicode-aware pattern (two branches) catches MRI, mmHg, HbA1c, CH3, iPhone, mRNA, ΔG, μM, β2, 5G, 3D, … while correctly rejecting ordinary capitalised words (Patient, Hello, The) and ordinals across English / French / Dutch (1st, 1er, 1e) — no exclusion list. Override via pattern=….
  • TP/FN/FP plumbing reuses _KTStats's alignment-based classification; uses normalized=False Levenshtein so case mismatches (e.g. ref MRI vs hyp mri) are surfaced as FN + FP rather than silently passing.
  • 148 new tests; full suite still passes (1000 / 1000, 0 regressions); pre-commit clean.

Linear

SR-2628

Introduces AlphaNumP, AlphaNumR, and AlphaNumF — precision, recall, and F-score
over alphanumerical entities (initialisms, acronyms, chemical/unit notation,
mixed-case medical/brand terms, digit-prefixed entities) detected via a
Unicode-aware regex predicate over case-preserving tokens. No vocabulary input
required; mirrors the existing KTP/KTR/KTF/_KTStats architecture but swaps the
trie/vocab lookup for a stateless regex helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 19:02
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 28, 2026

SR-2628

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@codecov-commenter
Copy link
Copy Markdown

Codecov Comments Bot (codecov-commenter) commented May 28, 2026

Codecov Report

❌ Patch coverage is 92.09622% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.90%. Comparing base (a7727e3) to head (c49376d).

Files with missing lines Patch % Lines
src/bewer/preprocessing/regex_match.py 83.72% 3 Missing and 4 partials ⚠️
src/bewer/metrics/alphanum_r.py 87.23% 4 Missing and 2 partials ⚠️
src/bewer/metrics/alphanum_f.py 90.69% 3 Missing and 1 partial ⚠️
src/bewer/metrics/alphanum_p.py 91.48% 3 Missing and 1 partial ⚠️
src/bewer/metrics/_alphanum_stats.py 98.13% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #56      +/-   ##
==========================================
+ Coverage   88.56%   88.90%   +0.33%     
==========================================
  Files          47       52       +5     
  Lines        2782     3073     +291     
  Branches      342      372      +30     
==========================================
+ Hits         2464     2732     +268     
- Misses        237      252      +15     
- Partials       81       89       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Replace the domain-flavoured examples (chemical/unit notation, medical/brand
terms) with an explicit description of what the default regex actually matches.
The metric is a general predicate over Token.raw, not a domain-specific
detector — descriptions should reflect that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default tokenizer splits on `-`, so multi-token entities like CT-scan and
X-ray were previously only partially detected (just `CT`) or missed entirely
(`X` alone fails the length-≥2 rule). Extend the matcher to recognise
hyphen-connected runs of tokens as compound candidates:

- ALPHANUM_DEFAULT_PATTERN: allow `-` in the body and expand the negative
  lookahead to reject "ordinary capitalised compounds" (init-cap or lowercase
  parts joined by hyphens), so Hello-World, up-to-date, e-mail, mother-in-law,
  state-of-the-art remain rejected while CT-scan, X-ray, T-cell, pre-MRI,
  non-COVID, 5-HT, vitamin-D, MRI-CT, pre-COVID-19, Hello-MRI all match.

- match_token_regex: identify runs of consecutive tokens whose source-text
  gaps consist of only hyphens, try a compound fullmatch against the joined
  standardized substring, and return a multi-token slice when it matches. Fall
  back to per-token matching within the run otherwise.

- _AlphaNumStats multi-token slices work naturally with the existing
  alignment-based TP/FN/FP classification: the slice spans the same alignment
  ops it would as multiple single-token slices, so any edit inside the
  compound makes it FN, all-MATCH makes it TP.

Tests added: 57 new cases — regex-level (compound match/reject/FP at
fullmatch), helper-level (multi-token slicing, fallback, spaces disabling
compounds, three-part compounds, mixed text), stats-level (perfect match,
case lost, hyphen dropped (still TP), partial loss, spurious in hyp, ordinary
capitalised compound stays at 0 entities).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
T-shirt, D-day, A-frame, S-curve are structurally identical to X-ray and are
correctly identified by the regex as case-distinctive. Same for shouted words
like THE, STOP, NO. They aren't "documented false positives" — they're true
positives of the metric as defined ("token contains a case signal"). Reframe
accordingly:

- Fold T-shirt/D-day/A-frame/S-curve into the compound matches list; remove
  the separate "documented false positives" compound test.
- Fold THE/STOP/NO into the single-token matches list; remove the separate
  single-token "documented false positives" test.
- Strip the apologetic "Note: ..." prefix from AlphaNumP's description on the
  case-sensitivity caveat (it's a property of the metric, not a limitation
  framed against the user's expectations).

No regex or matching behaviour changes — only test grouping and copy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two behavior changes:

1. Hyphen structure is now enforced strictly. Previously, ref CT-scan vs hyp
   "CT scan" counted as TP because tokens [CT, scan] were identical on both
   sides — the alignment is token-level and the hyphen is in the source-text
   gap. Now: a multi-token compound ref match requires the corresponding hyp
   tokens to also be hyphen-connected; otherwise FN. Symmetric on the hyp
   side: a hyp compound with no matching hyphen structure in ref counts as
   FP (hyphen invented). Plumbed via a new public helper
   tokens_are_hyphen_connected in regex_match.py and an
   hyp_token_idx-based check in _AlphaNumStats_._ref_match_classification
   and fp_alignments.

2. Any token (or compound) containing a Greek letter is treated as an
   entity, regardless of case. Adds a third branch to
   ALPHANUM_DEFAULT_PATTERN: (?=.*\p{Greek})[\p{L}\d][-\p{L}\d]*. So α, β,
   μ, μg, α-helix, β-blocker, γδ now all match. Previously rejected because
   they had no uppercase letter and no digit; the new branch is independent
   of those conditions.

Tests updated:
- test_compound_hyphen_dropped_keeps_tp → test_compound_hyphen_dropped_is_fn
- New test_hyp_invents_hyphen_is_fp covering the symmetric FP case
- New TestAlphaNumStatsGreekTokens covering single Greek letter, μg, and
  Greek substitution in hyp
- α-helix / β-blocker moved from compound rejects to compound matches
- γ-radiation added to compound matches
- Single Greek tokens (α, β, γ, δ, μ, Δ, Ω) moved from rejects into a new
  test_greek_tokens_match block

All 1069 tests pass; pre-commit clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
THE, HELLO, STOP, NO, OK all match the regex's case-distinctive rule (≥2
uppercase letters, not init-cap-word shape) but are semantically just shouted
ordinary words, not abbreviations. Re-add the test to explicitly label them
as documented false positives — making the trade-off legible rather than
silently mixed into the main matches list.

T-shirt, D-day, A-frame, S-curve remain in the compound matches list (true
positives — structurally identical to X-ray, indistinguishable without a
vocabulary).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants