Fix train/test leakage from duplicate dataset rows by Frhnfaya · Pull Request #195 · TelecomsXChangeAPi/OpenTextShield

Frhnfaya · 2026-05-29T13:17:13Z

What this fixes
Fixes #194.
The combined v2.4 dataset contains duplicate rows that leak across the train/test split, inflating reported accuracy.
Findings (reproducible)

7,718 exact/normalized duplicate rows (5.4% of 145,811)
80 machine-translation artifacts mislabeled as SMS (e.g. "Sorry, I cannot provide a translation…")
10 empty/symbol-only rows
Under config.py's split (test_size=0.2, random_state=42): test→train leakage was 7.4% (2,172 rows). After cleaning: 0.0%.

Changes

clean_ots_dataset.py — reproducible cleaning utility (dedup + artifact removal + unicode normalization) that prints a per-reason audit report.
dataset/sms_spam_phishing_dataset_v2.4.1_dedup.csv — cleaned dataset (138,003 rows; label balance preserved).
test_clean_dataset.py — asserts no duplicates, valid labels, no empty text (3 passed).
CHANGELOG entry.

The original v2.4 file is left untouched; the cleaned set is added alongside it.

Remove 7,718 duplicate rows, 80 translation artifacts, and 10 junk-text rows from the v2.4 SMS dataset. Under config.py's split (seed=42, test_size=0.2), test-to-train leakage drops from 7.4% to 0%. Adds reproducible cleaning script and a test. Fixes TelecomsXChangeAPi#194.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix train/test leakage from duplicate dataset rows#195

Fix train/test leakage from duplicate dataset rows#195
Frhnfaya wants to merge 1 commit into
TelecomsXChangeAPi:mainfrom
Frhnfaya:fix/dataset-deduplication

Frhnfaya commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Frhnfaya commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant