Keyword dedup: merge action, duplicate-keywords health check, whitespace ward (#1352)#1361
Merged
Conversation
… check (#1352) Keywords are free-text with no uniqueness constraint, so case/whitespace variants (Speech / speech / Speech ) coexist and fragment the public keyword pages. This adds both halves of the cleanup loop: Finder — a read-only "Duplicate keywords" data-health check that clusters keywords sharing a normalized key (strip + casefold) and surfaces per-model usage counts, so the editor can see which variant to keep. Fixer — a destructive "Merge selected keywords" admin action on the Keyword changelist. Select 2+ keywords -> intermediate confirmation page to pick the target -> reassigns every reference across all six keyword-holding models (Publication/Talk/Poster/Grant/Project/ProjectUmbrella) onto the target, then deletes the rest. Reattach is via obj.keywords.add(target) (idempotent, so an object already tagged with the target gains no duplicate row) inside a transaction; the source's deletion drops its own M2M rows. Tests cover reassignment across all six relations, source deletion, the no-duplicate dedup case, the target-in-sources guard, the single-selection no-op, the confirm-POST round trip, and the finder's clustering. Admin-only; no model or migration changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Layer-1 ward against near-duplicate keywords: Keyword.save() now trims the ends and collapses internal whitespace runs to a single space, so "Speech ", " Speech", and "Speech recognition" can't coexist with their clean forms. Catches every creation path, including the inline "add keyword" widget on Publication/Project forms. No migration, no data cleanup needed. Casing is intentionally preserved (VR, HCI, iOS). Case-insensitive uniqueness (blocking "Speech" vs "speech") is the separate layer-2 DB constraint, deferred until existing prod dupes are merged with the new action — the data-health finder's job is now exactly that remaining case-variant class. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st (#1352) Wires the finder to the fixer. The "Duplicate keywords" detail page now shows a per-row "Merge in admin →" link that opens the Keyword changelist pre-filtered (?q=<cluster key>) to exactly that cluster's variants, so the editor can select-all and run the merge action instead of re-finding them by hand. Implemented as an opt-in HealthCheck.row_link((label, url)) hook: the detail view adds an Action column only when a check provides links, so the other nine checks and the CSV export are unchanged. Covered by a row_link URL test and an end-to-end superuser render test asserting the link is in the page. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements #1352 — the destructive keyword-merge tooling deferred from the #1346 Phase 4 audit — plus the finder and a layer-1 prevention ward.
What
Finder — "Duplicate keywords" data-health check. Read-only check that clusters keywords sharing a normalized key (
strip+casefold) and surfaces per-model usage counts, so the editor can see which variant to keep. Each row deep-links ("Merge in admin →") to the Keyword changelist pre-filtered (?q=<cluster key>) to that cluster.Fixer — "Merge selected keywords" admin action. Select 2+ keywords on the Keyword changelist → intermediate confirmation page to pick the target → reassigns every reference across all six keyword-holding models (Publication/Talk/Poster/Grant/Project/ProjectUmbrella) onto the target, then deletes the rest. Reattach is via
obj.keywords.add(target)(idempotent — an object already tagged with the target gains no duplicate row) inside a transaction; the source's deletion drops its own M2M rows.Layer-1 ward — whitespace normalization on save.
Keyword.save()trims ends and collapses internal whitespace runs, so"Speech "/" Speech"can't coexist with their clean forms. Catches every creation path including the inline "add keyword" widget. No migration, casing preserved (VR, HCI, iOS).How to use
Admin → Configuration → Keywords → tick 2+ → Action dropdown → Merge selected keywords → pick target → Merge. Or start from Data Health → Duplicate keywords and click Merge in admin →.
Scope
Admin-only; no model or migration changes (not Pa11y-scanned). The merge UI is the standard Django intermediate-action confirmation page.
Layer 2 (follow-up, not in this PR)
Case-insensitive uniqueness (blocking
Speechvsspeech) is a DBUniqueConstraint(Lower('keyword'))— a migration that will fail to apply while dupes exist, so it must come after a prod dedup pass using this tool. Tracked separately.Test
python manage.py test website.tests.test_keyword_merge website.tests.test_data_health --settings=makeabilitylab.settings_test— green (incl. all-six-relation reassignment, dedup, single-selection no-op, confirm round-trip, whitespace normalization, finder clustering, and an end-to-end deep-link render).🤖 Generated with Claude Code