Skip to content

Fix train/test leakage from duplicate dataset rows#195

Open
Frhnfaya wants to merge 1 commit into
TelecomsXChangeAPi:mainfrom
Frhnfaya:fix/dataset-deduplication
Open

Fix train/test leakage from duplicate dataset rows#195
Frhnfaya wants to merge 1 commit into
TelecomsXChangeAPi:mainfrom
Frhnfaya:fix/dataset-deduplication

Conversation

@Frhnfaya
Copy link
Copy Markdown

What this fixes
Fixes #194.
The combined v2.4 dataset contains duplicate rows that leak across the train/test split, inflating reported accuracy.
Findings (reproducible)

7,718 exact/normalized duplicate rows (5.4% of 145,811)
80 machine-translation artifacts mislabeled as SMS (e.g. "Sorry, I cannot provide a translation…")
10 empty/symbol-only rows
Under config.py's split (test_size=0.2, random_state=42): test→train leakage was 7.4% (2,172 rows). After cleaning: 0.0%.

Changes

clean_ots_dataset.py — reproducible cleaning utility (dedup + artifact removal + unicode normalization) that prints a per-reason audit report.
dataset/sms_spam_phishing_dataset_v2.4.1_dedup.csv — cleaned dataset (138,003 rows; label balance preserved).
test_clean_dataset.py — asserts no duplicates, valid labels, no empty text (3 passed).
CHANGELOG entry.

The original v2.4 file is left untouched; the cleaned set is added alongside it.

Remove 7,718 duplicate rows, 80 translation artifacts, and 10 junk-text rows from the v2.4 SMS dataset. Under config.py's split (seed=42, test_size=0.2), test-to-train leakage drops from 7.4% to 0%. Adds reproducible cleaning script and a test. Fixes TelecomsXChangeAPi#194.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset v2.4 has duplicate rows causing train/test leakage

1 participant