From 0f58f195ecc25787b2773f963b5c6f7cc083956e Mon Sep 17 00:00:00 2001 From: Jon Froehlich Date: Mon, 22 Jun 2026 08:00:22 -0700 Subject: [PATCH] feat(admin): data-health check for artifacts not linked to a project (#649) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a read-only "Artifacts not linked to a project" Data Health check at /admin/data-health/ so the backlog of unlinked talks, papers, videos, and posters stays visible and shrinks over time instead of living in a one-off issue. - Excludes pre-Makeability-Lab work (date < settings.DATE_MAKEABILITYLAB_FORMED), reusing the same cutoff the publications view already applies. - Flags rows whose parent publication is already linked — those can simply inherit its projects (quickest wins). - Each row deep-links to the artifact's admin edit page to add the project inline; CSV export; strictly read-only. - Keeps data-health row-action buttons (e.g. "Open ->") from wrapping mid-word in a narrow column (shared detail.html style; also helps existing checks). Regression-tested in website/tests/test_unlinked_artifacts_check.py. Scope and a deferred semi-automated matching pipeline are documented in docs/plans/issue-649-link-artifacts-to-projects.md. Bumps version to 2.17.1. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../issue-649-link-artifacts-to-projects.md | 107 +++++++++++++++++ makeabilitylab/settings.py | 4 +- website/admin/data_health/checks/__init__.py | 1 + .../data_health/checks/unlinked_artifacts.py | 110 ++++++++++++++++++ .../templates/admin/data_health/detail.html | 3 + .../tests/test_unlinked_artifacts_check.py | 65 +++++++++++ 6 files changed, 288 insertions(+), 2 deletions(-) create mode 100644 docs/plans/issue-649-link-artifacts-to-projects.md create mode 100644 website/admin/data_health/checks/unlinked_artifacts.py create mode 100644 website/tests/test_unlinked_artifacts_check.py diff --git a/docs/plans/issue-649-link-artifacts-to-projects.md b/docs/plans/issue-649-link-artifacts-to-projects.md new file mode 100644 index 00000000..0d742854 --- /dev/null +++ b/docs/plans/issue-649-link-artifacts-to-projects.md @@ -0,0 +1,107 @@ +# Issue #649 — Link old talks / papers / videos to projects + +> Retroactively populate the `projects` M2M on existing artifacts that currently +> have none. Source of truth for scoping: `~/Downloads/makeability-prod-2026-06-14.sql.gz` +> (prod snapshot, 2026-06-14). + +## Status (2026-06-22) + +**Shipped:** a read-only Data Health check, `UnlinkedArtifactsCheck` +(`website/admin/data_health/checks/unlinked_artifacts.py`), surfacing unlinked +artifacts at `/admin/data-health/` — **pre-2012 work excluded** (pre-Makeability- +Lab — cutoff is `settings.DATE_MAKEABILITYLAB_FORMED`), parent-publication +propagation flagged, deep-links to each edit page, CSV export, regression-tested +(`website/tests/test_unlinked_artifacts_check.py`). +Issue #649 updated with the scope table + decision. + +**Deferred:** the semi-automated suggestion/apply pipeline below. Not built — +held unless the manual route through the health check proves too slow. Kept here +for reference. + +--- + +## Scope (measured from the prod dump) + +| Type | Total | Linked (≥1 project) | **Unlinked** | +|-------------|------:|--------------------:|-------------:| +| Publication | 227 | 164 | **63** | +| Talk | 187 | 98 | **89** | +| Poster | 9 | 8 | **1** | +| Video | 74 | 30 | **44** | +| **Total** | | | **197** | + +85 projects exist to link against. + +### Key data-model facts +- `Artifact` is **abstract**; each concrete type has its own through table: + `website_publication_projects`, `website_talk_projects`, + `website_poster_projects`, `website_video_projects`. +- A `Publication` carries `talk_id`, `video_id`, `poster_id` FKs to its own + child artifacts. +- Publications & talks have authors (`*_authors`) and keywords (`*_keywords`). + **Videos have neither** (only title/caption/date) and posters have authors. +- `ProjectRole(person_id, project_id, lead_project_role, start/end_date)` maps + people to projects over time. + +### Why naive matching fails +Author overlap **alone is useless for disambiguation**: all 63 unlinked pubs and +all 89 unlinked talks each map to *more than one* candidate project (frequent +authors sit on many projects). Matching must **combine signals and rank**. + +## Approach — two tiers + +### Tier 1 — Propagation (high confidence, can auto-apply) +A child artifact inherits the projects of its parent publication. +From the dump this safely links **23 videos + 2 talks** (their parent pub is +already linked; child is not). Near-zero risk — the publication is the same +scholarly artifact. + +### Tier 2 — Ranked suggestions (human-reviewed) +For the remaining ~172 artifacts, score every (artifact, project) pair with a +weighted blend and emit the top candidates for review: + +- **Author ∈ project members** (via `ProjectRole`); weight ↑ for lead role, + weight ↑ for role-window overlapping the artifact date. +- **Keyword overlap** — artifact keywords ∩ project keywords / umbrella keywords. +- **Title ↔ project** — token/substring match of artifact title against project + `name` + `short_name`. +- **Date proximity** — artifact date within the project's active window. + +Videos (no authors/keywords) lean on Tier-1 propagation first, then +title/caption ↔ project-name matching only. + +## Deliverables + +1. **`suggest_artifact_projects` management command** + `website/management/commands/suggest_artifact_projects.py` + - Reads the live DB (so it runs in any environment), computes Tier-1 + + Tier-2, writes a reviewable **CSV**: one row per artifact with its top-3 + ranked project candidates, each with score + human-readable reason, plus an + `approved_project_ids` column for the reviewer to fill/edit. + - `--tier1-only` flag to emit just the high-confidence propagation set. + - Read-only; writes nothing to the DB. + +2. **`apply_artifact_projects` management command** + `website/management/commands/apply_artifact_projects.py` + - Consumes the reviewed CSV and adds the approved links (idempotent — never + removes existing links, skips rows already linked). + - `--dry-run` prints what it *would* do. + +3. **Tests** (`website/tests/test_link_artifacts.py`, `DatabaseTestCase`) + - Tier-1 propagation correctness (child gets parent's projects). + - Scorer ranking on a small fixture (right project ranks first). + - `apply` idempotency + dry-run does nothing. + +## Workflow for Jon +1. Load the prod snapshot into a local scratch DB (so suggestions reflect prod). +2. `manage.py suggest_artifact_projects --out suggestions.csv` +3. Review/edit the CSV (approve/correct the `approved_project_ids` column). +4. `manage.py apply_artifact_projects suggestions.csv --dry-run`, then for real. +5. **Apply to prod** via the established entrypoint one-shot pattern: ship the + reviewed CSV + an `apply` invocation through `docker-entrypoint.sh`, verify in + logs. (No direct prod DB access — per repo constraints.) + +## Open questions +- Auto-apply Tier-1 (the 25 propagations) without per-row review, or fold them + into the same review CSV? +- CSV format OK, or prefer reviewing inside Django admin instead? diff --git a/makeabilitylab/settings.py b/makeabilitylab/settings.py index d61a4757..a47ac9ae 100644 --- a/makeabilitylab/settings.py +++ b/makeabilitylab/settings.py @@ -86,8 +86,8 @@ SECURE_PROXY_SSL_HEADER = ('HTTP_X_FORWARDED_PROTO', 'https') # Makeability Lab Global Variables, including Makeability Lab version -ML_WEBSITE_VERSION = "2.17.0" # Keep this updated with each release and also change the short description below -ML_WEBSITE_VERSION_DESCRIPTION = "Add an unauthenticated machine-readable build/version endpoint at /version/ (and /version.json) returning JSON: version, description, environment, git_sha, and built_at. The git short SHA and build timestamp are captured once at container start by docker-entrypoint.sh into a gitignored build-info.json (falling back to 'unknown'), so you can confirm what code a server is actually running without scraping the HTML comment in base.html. Sets Cache-Control: no-store so no proxy serves a stale version (#1366)." +ML_WEBSITE_VERSION = "2.17.1" # Keep this updated with each release and also change the short description below +ML_WEBSITE_VERSION_DESCRIPTION = "Add an admin Data Health check, 'Artifacts not linked to a project' (/admin/data-health/), that surfaces every talk, paper, video, and poster with no project assigned so the backlog stays visible and shrinks over time. Pre-Makeability-Lab work (before settings.DATE_MAKEABILITYLAB_FORMED) is excluded, rows whose parent publication is already linked are flagged as quick wins (inherit its projects), and each row deep-links to its admin edit page. Read-only with CSV export. Also keeps data-health row-action buttons from wrapping mid-word in narrow columns (#649)." DATE_MAKEABILITYLAB_FORMED = datetime.date(2012, 1, 1) # Date Makeability Lab was formed MAX_BANNERS = 7 # Maximum number of banners on a page diff --git a/website/admin/data_health/checks/__init__.py b/website/admin/data_health/checks/__init__.py index 557a265c..a248e74d 100644 --- a/website/admin/data_health/checks/__init__.py +++ b/website/admin/data_health/checks/__init__.py @@ -8,6 +8,7 @@ url_name_collisions, media_integrity, publication_quality, + unlinked_artifacts, project_health, project_leadership, position_integrity, diff --git a/website/admin/data_health/checks/unlinked_artifacts.py b/website/admin/data_health/checks/unlinked_artifacts.py new file mode 100644 index 00000000..7a6ad0a1 --- /dev/null +++ b/website/admin/data_health/checks/unlinked_artifacts.py @@ -0,0 +1,110 @@ +""" +Data-health check: artifacts not linked to any project (issue #649). + +As the lab keeps adding projects, older talks / papers / videos / posters need +to be linked back to the projects they belong to (the ``projects`` M2M). This +check surfaces every artifact that currently has **zero** projects so the +backlog stays visible instead of living in a one-off issue. + +Scoping decisions: + +- **Pre-Makeability-Lab work is excluded.** Artifacts dated before + ``settings.DATE_MAKEABILITYLAB_FORMED`` (the grad-school era; same cutoff the + publications view uses) don't belong to a lab project and would be permanent + false positives, so they're filtered out. Artifacts with **no date** are kept + — a missing date is itself worth a look. +- **Propagation hint.** A ``Talk`` / ``Video`` / ``Poster`` that is the child of + a publication (via ``Publication.talk_id`` / ``video_id`` / ``poster_id``) + should inherit that publication's projects. When the parent publication is + already linked, the row's ``note`` says so — those are the quickest wins. + +Read-only: never calls ``.save()`` or mutates the DB. +""" + +from django.conf import settings +from django.urls import reverse + +from website.admin.data_health.registry import HealthCheck, register_check +from website.models import Poster, Publication, Talk, Video + + +@register_check +class UnlinkedArtifactsCheck(HealthCheck): + slug = 'unlinked-artifacts' + title = 'Artifacts not linked to a project' + description = ( + 'Talks, papers, videos, and posters with no project assigned (#649). ' + 'Pre-Makeability-Lab work (before the lab was formed) is excluded. ' + 'Rows whose parent publication is already linked can simply inherit ' + 'its projects.' + ) + group = 'Artifacts' + columns = ['type', 'id', 'title', 'date', 'first_author', 'note'] + + def get_rows(self): + rows = [] + rows += self._artifact_rows(Publication, 'Publication') + rows += self._artifact_rows(Talk, 'Talk', parent_fk='talk') + rows += self._artifact_rows(Poster, 'Poster', parent_fk='poster') + rows += self._artifact_rows(Video, 'Video', parent_fk='video') + + # Newest first within type; types stay grouped by insertion order above. + rows.sort(key=lambda r: (r['type'], r['date'] or '', r['id']), + reverse=True) + rows.sort(key=lambda r: r['type']) + return rows + + def _artifact_rows(self, model, type_label, parent_fk=None): + """Build rows for one artifact ``model`` with no linked projects. + + ``parent_fk`` is the ``Publication`` FK name pointing at this child + (``'talk'`` / ``'video'`` / ``'poster'``); when set, we note whether the + parent publication is already linked so its projects can be inherited. + """ + qs = (model.objects + .filter(projects__isnull=True) + .prefetch_related('projects')) + + # Map child id -> whether its parent publication has projects. + parent_linked = {} + if parent_fk: + pub_qs = (Publication.objects + .filter(**{f'{parent_fk}__isnull': False}) + .values_list(f'{parent_fk}_id', 'projects')) + for child_id, project_id in pub_qs: + # project_id is None when the parent pub itself has no projects. + parent_linked[child_id] = parent_linked.get(child_id, False) or \ + project_id is not None + + rows = [] + for obj in qs: + artifact_date = getattr(obj, 'date', None) + if artifact_date and artifact_date < settings.DATE_MAKEABILITYLAB_FORMED: + continue # pre-Makeability-Lab; not expected to have a project + + note = '' + if parent_fk and parent_linked.get(obj.pk): + note = 'parent publication is linked — inherit its projects' + + rows.append({ + 'type': type_label, + 'id': obj.pk, + 'title': obj.title, + 'date': artifact_date.isoformat() if artifact_date else '', + 'first_author': self._first_author(obj), + 'note': note, + }) + return rows + + @staticmethod + def _first_author(obj): + """First-author last name (Videos have no authors → '').""" + getter = getattr(obj, 'get_first_author_last_name', None) + return getter() if getter else '' + + def row_link(self, row): + """Deep-link each row to its artifact edit page so the editor can add + the project right there.""" + model_name = row['type'].lower() + url = reverse(f'admin:website_{model_name}_change', args=[row['id']]) + return ('Open →', url) diff --git a/website/templates/admin/data_health/detail.html b/website/templates/admin/data_health/detail.html index 34907ff8..c0ac1301 100644 --- a/website/templates/admin/data_health/detail.html +++ b/website/templates/admin/data_health/detail.html @@ -50,6 +50,9 @@ .dh-actions { margin: 0 0 16px; } .dh-rowcount { margin-left: 12px; color: #777; } .dh-table-wrap { overflow-x: auto; } + /* Keep row-action buttons (e.g. "Open →") on one line instead of wrapping + mid-word in a narrow column; the table scrolls horizontally if needed. */ + .dh-table-wrap td .button { white-space: nowrap; } .dh-empty { font-size: 14px; color: #2e7d32; } @media (prefers-color-scheme: dark) { .dh-desc, .dh-rowcount { color: #aaa; } diff --git a/website/tests/test_unlinked_artifacts_check.py b/website/tests/test_unlinked_artifacts_check.py new file mode 100644 index 00000000..5beea0a9 --- /dev/null +++ b/website/tests/test_unlinked_artifacts_check.py @@ -0,0 +1,65 @@ +""" +Regression tests for the "Artifacts not linked to a project" data-health check +(website/admin/data_health/checks/unlinked_artifacts.py, issue #649). +""" + +from datetime import timedelta + +from django.conf import settings + +from website.admin.data_health.checks.unlinked_artifacts import ( + UnlinkedArtifactsCheck, +) +from website.tests.base import DatabaseTestCase + + +class UnlinkedArtifactsCheckTests(DatabaseTestCase): + def setUp(self): + self.check = UnlinkedArtifactsCheck() + + def _rows_by_type_id(self): + return {(r['type'], r['id']): r for r in self.check.get_rows()} + + def test_unlinked_artifact_is_flagged(self): + pub = self.make_publication(title="Orphan paper", year=2024) + rows = self._rows_by_type_id() + self.assertIn(('Publication', pub.pk), rows) + + def test_linked_artifact_is_not_flagged(self): + pub = self.make_publication(title="Linked paper", year=2024) + pub.projects.add(self.make_project(name="Some Project")) + rows = self._rows_by_type_id() + self.assertNotIn(('Publication', pub.pk), rows) + + def test_pre_lab_artifact_is_excluded(self): + """Pre-Makeability-Lab work (grad school) shouldn't be flagged.""" + formed = settings.DATE_MAKEABILITYLAB_FORMED + old = self.make_publication( + title="Grad-school paper", date=formed - timedelta(days=1) + ) + recent = self.make_publication(title="Lab paper", date=formed) + rows = self._rows_by_type_id() + self.assertNotIn(('Publication', old.pk), rows) + self.assertIn(('Publication', recent.pk), rows) + + def test_child_of_linked_publication_gets_inherit_note(self): + """A talk whose parent publication is linked should carry the + propagation hint so it's an easy win.""" + talk = self.make_talk(title="Conference talk", year=2024) + pub = self.make_publication(title="Talk's paper", year=2024, talk=talk) + pub.projects.add(self.make_project(name="Parent Project")) + + row = self._rows_by_type_id()[('Talk', talk.pk)] + self.assertIn('inherit', row['note']) + + def test_orphan_child_has_no_inherit_note(self): + talk = self.make_talk(title="Standalone talk", year=2024) + row = self._rows_by_type_id()[('Talk', talk.pk)] + self.assertEqual(row['note'], '') + + def test_row_link_points_to_admin_change_page(self): + pub = self.make_publication(title="Linkable paper", year=2024) + row = self._rows_by_type_id()[('Publication', pub.pk)] + label, url = self.check.row_link(row) + self.assertIn(str(pub.pk), url) + self.assertIn('publication', url)