Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions docs/plans/issue-649-link-artifacts-to-projects.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Issue #649 — Link old talks / papers / videos to projects

> Retroactively populate the `projects` M2M on existing artifacts that currently
> have none. Source of truth for scoping: `~/Downloads/makeability-prod-2026-06-14.sql.gz`
> (prod snapshot, 2026-06-14).

## Status (2026-06-22)

**Shipped:** a read-only Data Health check, `UnlinkedArtifactsCheck`
(`website/admin/data_health/checks/unlinked_artifacts.py`), surfacing unlinked
artifacts at `/admin/data-health/` — **pre-2012 work excluded** (pre-Makeability-
Lab — cutoff is `settings.DATE_MAKEABILITYLAB_FORMED`), parent-publication
propagation flagged, deep-links to each edit page, CSV export, regression-tested
(`website/tests/test_unlinked_artifacts_check.py`).
Issue #649 updated with the scope table + decision.

**Deferred:** the semi-automated suggestion/apply pipeline below. Not built —
held unless the manual route through the health check proves too slow. Kept here
for reference.

---

## Scope (measured from the prod dump)

| Type | Total | Linked (≥1 project) | **Unlinked** |
|-------------|------:|--------------------:|-------------:|
| Publication | 227 | 164 | **63** |
| Talk | 187 | 98 | **89** |
| Poster | 9 | 8 | **1** |
| Video | 74 | 30 | **44** |
| **Total** | | | **197** |

85 projects exist to link against.

### Key data-model facts
- `Artifact` is **abstract**; each concrete type has its own through table:
`website_publication_projects`, `website_talk_projects`,
`website_poster_projects`, `website_video_projects`.
- A `Publication` carries `talk_id`, `video_id`, `poster_id` FKs to its own
child artifacts.
- Publications & talks have authors (`*_authors`) and keywords (`*_keywords`).
**Videos have neither** (only title/caption/date) and posters have authors.
- `ProjectRole(person_id, project_id, lead_project_role, start/end_date)` maps
people to projects over time.

### Why naive matching fails
Author overlap **alone is useless for disambiguation**: all 63 unlinked pubs and
all 89 unlinked talks each map to *more than one* candidate project (frequent
authors sit on many projects). Matching must **combine signals and rank**.

## Approach — two tiers

### Tier 1 — Propagation (high confidence, can auto-apply)
A child artifact inherits the projects of its parent publication.
From the dump this safely links **23 videos + 2 talks** (their parent pub is
already linked; child is not). Near-zero risk — the publication is the same
scholarly artifact.

### Tier 2 — Ranked suggestions (human-reviewed)
For the remaining ~172 artifacts, score every (artifact, project) pair with a
weighted blend and emit the top candidates for review:

- **Author ∈ project members** (via `ProjectRole`); weight ↑ for lead role,
weight ↑ for role-window overlapping the artifact date.
- **Keyword overlap** — artifact keywords ∩ project keywords / umbrella keywords.
- **Title ↔ project** — token/substring match of artifact title against project
`name` + `short_name`.
- **Date proximity** — artifact date within the project's active window.

Videos (no authors/keywords) lean on Tier-1 propagation first, then
title/caption ↔ project-name matching only.

## Deliverables

1. **`suggest_artifact_projects` management command**
`website/management/commands/suggest_artifact_projects.py`
- Reads the live DB (so it runs in any environment), computes Tier-1 +
Tier-2, writes a reviewable **CSV**: one row per artifact with its top-3
ranked project candidates, each with score + human-readable reason, plus an
`approved_project_ids` column for the reviewer to fill/edit.
- `--tier1-only` flag to emit just the high-confidence propagation set.
- Read-only; writes nothing to the DB.

2. **`apply_artifact_projects` management command**
`website/management/commands/apply_artifact_projects.py`
- Consumes the reviewed CSV and adds the approved links (idempotent — never
removes existing links, skips rows already linked).
- `--dry-run` prints what it *would* do.

3. **Tests** (`website/tests/test_link_artifacts.py`, `DatabaseTestCase`)
- Tier-1 propagation correctness (child gets parent's projects).
- Scorer ranking on a small fixture (right project ranks first).
- `apply` idempotency + dry-run does nothing.

## Workflow for Jon
1. Load the prod snapshot into a local scratch DB (so suggestions reflect prod).
2. `manage.py suggest_artifact_projects --out suggestions.csv`
3. Review/edit the CSV (approve/correct the `approved_project_ids` column).
4. `manage.py apply_artifact_projects suggestions.csv --dry-run`, then for real.
5. **Apply to prod** via the established entrypoint one-shot pattern: ship the
reviewed CSV + an `apply` invocation through `docker-entrypoint.sh`, verify in
logs. (No direct prod DB access — per repo constraints.)

## Open questions
- Auto-apply Tier-1 (the 25 propagations) without per-row review, or fold them
into the same review CSV?
- CSV format OK, or prefer reviewing inside Django admin instead?
4 changes: 2 additions & 2 deletions makeabilitylab/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@
SECURE_PROXY_SSL_HEADER = ('HTTP_X_FORWARDED_PROTO', 'https')

# Makeability Lab Global Variables, including Makeability Lab version
ML_WEBSITE_VERSION = "2.17.0" # Keep this updated with each release and also change the short description below
ML_WEBSITE_VERSION_DESCRIPTION = "Add an unauthenticated machine-readable build/version endpoint at /version/ (and /version.json) returning JSON: version, description, environment, git_sha, and built_at. The git short SHA and build timestamp are captured once at container start by docker-entrypoint.sh into a gitignored build-info.json (falling back to 'unknown'), so you can confirm what code a server is actually running without scraping the HTML comment in base.html. Sets Cache-Control: no-store so no proxy serves a stale version (#1366)."
ML_WEBSITE_VERSION = "2.17.1" # Keep this updated with each release and also change the short description below
ML_WEBSITE_VERSION_DESCRIPTION = "Add an admin Data Health check, 'Artifacts not linked to a project' (/admin/data-health/), that surfaces every talk, paper, video, and poster with no project assigned so the backlog stays visible and shrinks over time. Pre-Makeability-Lab work (before settings.DATE_MAKEABILITYLAB_FORMED) is excluded, rows whose parent publication is already linked are flagged as quick wins (inherit its projects), and each row deep-links to its admin edit page. Read-only with CSV export. Also keeps data-health row-action buttons from wrapping mid-word in narrow columns (#649)."
DATE_MAKEABILITYLAB_FORMED = datetime.date(2012, 1, 1) # Date Makeability Lab was formed
MAX_BANNERS = 7 # Maximum number of banners on a page

Expand Down
1 change: 1 addition & 0 deletions website/admin/data_health/checks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
url_name_collisions,
media_integrity,
publication_quality,
unlinked_artifacts,
project_health,
project_leadership,
position_integrity,
Expand Down
110 changes: 110 additions & 0 deletions website/admin/data_health/checks/unlinked_artifacts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
"""
Data-health check: artifacts not linked to any project (issue #649).

As the lab keeps adding projects, older talks / papers / videos / posters need
to be linked back to the projects they belong to (the ``projects`` M2M). This
check surfaces every artifact that currently has **zero** projects so the
backlog stays visible instead of living in a one-off issue.

Scoping decisions:

- **Pre-Makeability-Lab work is excluded.** Artifacts dated before
``settings.DATE_MAKEABILITYLAB_FORMED`` (the grad-school era; same cutoff the
publications view uses) don't belong to a lab project and would be permanent
false positives, so they're filtered out. Artifacts with **no date** are kept
— a missing date is itself worth a look.
- **Propagation hint.** A ``Talk`` / ``Video`` / ``Poster`` that is the child of
a publication (via ``Publication.talk_id`` / ``video_id`` / ``poster_id``)
should inherit that publication's projects. When the parent publication is
already linked, the row's ``note`` says so — those are the quickest wins.

Read-only: never calls ``.save()`` or mutates the DB.
"""

from django.conf import settings
from django.urls import reverse

from website.admin.data_health.registry import HealthCheck, register_check
from website.models import Poster, Publication, Talk, Video


@register_check
class UnlinkedArtifactsCheck(HealthCheck):
slug = 'unlinked-artifacts'
title = 'Artifacts not linked to a project'
description = (
'Talks, papers, videos, and posters with no project assigned (#649). '
'Pre-Makeability-Lab work (before the lab was formed) is excluded. '
'Rows whose parent publication is already linked can simply inherit '
'its projects.'
)
group = 'Artifacts'
columns = ['type', 'id', 'title', 'date', 'first_author', 'note']

def get_rows(self):
rows = []
rows += self._artifact_rows(Publication, 'Publication')
rows += self._artifact_rows(Talk, 'Talk', parent_fk='talk')
rows += self._artifact_rows(Poster, 'Poster', parent_fk='poster')
rows += self._artifact_rows(Video, 'Video', parent_fk='video')

# Newest first within type; types stay grouped by insertion order above.
rows.sort(key=lambda r: (r['type'], r['date'] or '', r['id']),
reverse=True)
rows.sort(key=lambda r: r['type'])
return rows

def _artifact_rows(self, model, type_label, parent_fk=None):
"""Build rows for one artifact ``model`` with no linked projects.

``parent_fk`` is the ``Publication`` FK name pointing at this child
(``'talk'`` / ``'video'`` / ``'poster'``); when set, we note whether the
parent publication is already linked so its projects can be inherited.
"""
qs = (model.objects
.filter(projects__isnull=True)
.prefetch_related('projects'))

# Map child id -> whether its parent publication has projects.
parent_linked = {}
if parent_fk:
pub_qs = (Publication.objects
.filter(**{f'{parent_fk}__isnull': False})
.values_list(f'{parent_fk}_id', 'projects'))
for child_id, project_id in pub_qs:
# project_id is None when the parent pub itself has no projects.
parent_linked[child_id] = parent_linked.get(child_id, False) or \
project_id is not None

rows = []
for obj in qs:
artifact_date = getattr(obj, 'date', None)
if artifact_date and artifact_date < settings.DATE_MAKEABILITYLAB_FORMED:
continue # pre-Makeability-Lab; not expected to have a project

note = ''
if parent_fk and parent_linked.get(obj.pk):
note = 'parent publication is linked — inherit its projects'

rows.append({
'type': type_label,
'id': obj.pk,
'title': obj.title,
'date': artifact_date.isoformat() if artifact_date else '',
'first_author': self._first_author(obj),
'note': note,
})
return rows

@staticmethod
def _first_author(obj):
"""First-author last name (Videos have no authors → '')."""
getter = getattr(obj, 'get_first_author_last_name', None)
return getter() if getter else ''

def row_link(self, row):
"""Deep-link each row to its artifact edit page so the editor can add
the project right there."""
model_name = row['type'].lower()
url = reverse(f'admin:website_{model_name}_change', args=[row['id']])
return ('Open →', url)
3 changes: 3 additions & 0 deletions website/templates/admin/data_health/detail.html
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@
.dh-actions { margin: 0 0 16px; }
.dh-rowcount { margin-left: 12px; color: #777; }
.dh-table-wrap { overflow-x: auto; }
/* Keep row-action buttons (e.g. "Open →") on one line instead of wrapping
mid-word in a narrow column; the table scrolls horizontally if needed. */
.dh-table-wrap td .button { white-space: nowrap; }
.dh-empty { font-size: 14px; color: #2e7d32; }
@media (prefers-color-scheme: dark) {
.dh-desc, .dh-rowcount { color: #aaa; }
Expand Down
65 changes: 65 additions & 0 deletions website/tests/test_unlinked_artifacts_check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
"""
Regression tests for the "Artifacts not linked to a project" data-health check
(website/admin/data_health/checks/unlinked_artifacts.py, issue #649).
"""

from datetime import timedelta

from django.conf import settings

from website.admin.data_health.checks.unlinked_artifacts import (
UnlinkedArtifactsCheck,
)
from website.tests.base import DatabaseTestCase


class UnlinkedArtifactsCheckTests(DatabaseTestCase):
def setUp(self):
self.check = UnlinkedArtifactsCheck()

def _rows_by_type_id(self):
return {(r['type'], r['id']): r for r in self.check.get_rows()}

def test_unlinked_artifact_is_flagged(self):
pub = self.make_publication(title="Orphan paper", year=2024)
rows = self._rows_by_type_id()
self.assertIn(('Publication', pub.pk), rows)

def test_linked_artifact_is_not_flagged(self):
pub = self.make_publication(title="Linked paper", year=2024)
pub.projects.add(self.make_project(name="Some Project"))
rows = self._rows_by_type_id()
self.assertNotIn(('Publication', pub.pk), rows)

def test_pre_lab_artifact_is_excluded(self):
"""Pre-Makeability-Lab work (grad school) shouldn't be flagged."""
formed = settings.DATE_MAKEABILITYLAB_FORMED
old = self.make_publication(
title="Grad-school paper", date=formed - timedelta(days=1)
)
recent = self.make_publication(title="Lab paper", date=formed)
rows = self._rows_by_type_id()
self.assertNotIn(('Publication', old.pk), rows)
self.assertIn(('Publication', recent.pk), rows)

def test_child_of_linked_publication_gets_inherit_note(self):
"""A talk whose parent publication is linked should carry the
propagation hint so it's an easy win."""
talk = self.make_talk(title="Conference talk", year=2024)
pub = self.make_publication(title="Talk's paper", year=2024, talk=talk)
pub.projects.add(self.make_project(name="Parent Project"))

row = self._rows_by_type_id()[('Talk', talk.pk)]
self.assertIn('inherit', row['note'])

def test_orphan_child_has_no_inherit_note(self):
talk = self.make_talk(title="Standalone talk", year=2024)
row = self._rows_by_type_id()[('Talk', talk.pk)]
self.assertEqual(row['note'], '')

def test_row_link_points_to_admin_change_page(self):
pub = self.make_publication(title="Linkable paper", year=2024)
row = self._rows_by_type_id()[('Publication', pub.pk)]
label, url = self.check.row_link(row)
self.assertIn(str(pub.pk), url)
self.assertIn('publication', url)
Loading