Add deterministic memory garbage collection with pruneStaleNodes#60
Conversation
… graph cruft Add `POST /maintenance/prune-stale-nodes` + `MemoryClient.pruneStaleNodes`: a preview-then-apply, LLM-free GC pass that scores every entity/task node on a transparent weighted sum of staleness, isolation, weak provenance, and claim decay, then prunes the disposable tail. One `aggressiveness` knob maps to the score threshold (or pin `minScore`). `dryRun` defaults true and returns ranked candidates with per-node `reasons`; re-call with `dryRun: false` to delete (cascades through claims, source links, aliases, embeddings). Protected: nodes active within `minIdleDays`, open tasks, the user's self-identity node(s), and reference-scope nodes unless opted in. Complements `pruneOrphanNodes` (evidence-free) and `dedupSweep` (exact-label).
There was a problem hiding this comment.
Code Review
This pull request introduces a deterministic, preview-then-apply memory garbage collection sweep via the new POST /maintenance/prune-stale-nodes endpoint and MemoryClient.pruneStaleNodes SDK method. It scores entity and task nodes based on staleness, isolation, weak provenance, and claim decay to prune low-value nodes. The review feedback highlights a potential performance bottleneck in the database query within scoreNodeRows due to an OR condition in the join with the claims table, suggesting a refactor to use a UNION ALL subquery to leverage indexes more efficiently.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const rows = await db | ||
| .select({ | ||
| id: nodes.id, | ||
| nodeType: nodes.nodeType, | ||
| label: nodeMetadata.label, | ||
| createdAt: nodes.createdAt, | ||
| lastClaimAt: sql<string | null>`max(${claims.statedAt})`.as( | ||
| "last_claim_at", | ||
| ), | ||
| totalClaims: | ||
| sql<number>`cast(count(distinct ${claims.id}) as integer)`.as( | ||
| "total_claims", | ||
| ), | ||
| activeClaims: sql<number>`cast(count(distinct ${claims.id}) filter ( | ||
| where ${claims.status} = 'active' | ||
| ) as integer)`.as("active_claims"), | ||
| supersededClaims: sql<number>`cast(count(distinct ${claims.id}) filter ( | ||
| where ${claims.status} <> 'active' | ||
| ) as integer)`.as("superseded_claims"), | ||
| groundedActiveClaims: | ||
| sql<number>`cast(count(distinct ${claims.id}) filter ( | ||
| where ${claims.status} = 'active' | ||
| and ${claims.assertedByKind} not in ('assistant_inferred', 'system') | ||
| ) as integer)`.as("grounded_active_claims"), | ||
| activeReferenceClaims: | ||
| sql<number>`cast(count(distinct ${claims.id}) filter ( | ||
| where ${claims.status} = 'active' and ${claims.scope} = 'reference' | ||
| ) as integer)`.as("active_reference_claims"), | ||
| activePersonalClaims: | ||
| sql<number>`cast(count(distinct ${claims.id}) filter ( | ||
| where ${claims.status} = 'active' and ${claims.scope} = 'personal' | ||
| ) as integer)`.as("active_personal_claims"), | ||
| hasAlias: sql<boolean>`bool_or(${aliases.id} is not null)`.as( | ||
| "has_alias", | ||
| ), | ||
| hasSourceLink: sql<boolean>`bool_or(${sourceLinks.id} is not null)`.as( | ||
| "has_source_link", | ||
| ), | ||
| }) | ||
| .from(nodes) | ||
| .leftJoin(nodeMetadata, eq(nodeMetadata.nodeId, nodes.id)) | ||
| .leftJoin( | ||
| claims, | ||
| and( | ||
| eq(claims.userId, params.userId), | ||
| sql`(${claims.subjectNodeId} = ${nodes.id} or ${claims.objectNodeId} = ${nodes.id})`, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Avoid using OR conditions in database joins (e.g., in PostgreSQL) as it can severely degrade query performance by preventing efficient index usage. Since we are scanning all nodes for the user to score them, we cannot resolve candidate IDs in-memory first. Instead, we can use a UNION ALL subquery to aggregate subject and object claims, allowing PostgreSQL to use indexes on subject_node_id and object_node_id perfectly.
const unionClaims = db
.select({
nodeId: sql<string>`node_id`.as("node_id"),
claimId: sql<string>`claim_id`.as("claim_id"),
status: sql<string>`status`.as("status"),
assertedByKind: sql<string>`asserted_by_kind`.as("asserted_by_kind"),
scope: sql<string>`scope`.as("scope"),
statedAt: sql<string>`stated_at`.as("stated_at"),
})
.from(
sql`(
SELECT subject_node_id AS node_id, id AS claim_id, status, asserted_by_kind, scope, stated_at
FROM claims
WHERE user_id = ${params.userId}
UNION ALL
SELECT object_node_id AS node_id, id AS claim_id, status, asserted_by_kind, scope, stated_at
FROM claims
WHERE user_id = ${params.userId} AND object_node_id IS NOT NULL
) as union_claims`
)
.as("union_claims");
const rows = await db
.select({
id: nodes.id,
nodeType: nodes.nodeType,
label: nodeMetadata.label,
createdAt: nodes.createdAt,
lastClaimAt: sql<string | null>`max(${unionClaims.statedAt})`.as(
"last_claim_at",
),
totalClaims:
sql<number>`cast(count(distinct ${unionClaims.claimId}) as integer)`.as(
"total_claims",
),
activeClaims: sql<number>`cast(count(distinct ${unionClaims.claimId}) filter (
where ${unionClaims.status} = 'active'
) as integer)`.as("active_claims"),
supersededClaims: sql<number>`cast(count(distinct ${unionClaims.claimId}) filter (
where ${unionClaims.status} <> 'active'
) as integer)`.as("superseded_claims"),
groundedActiveClaims:
sql<number>`cast(count(distinct ${unionClaims.claimId}) filter (
where ${unionClaims.status} = 'active'
and ${unionClaims.assertedByKind} not in ('assistant_inferred', 'system')
) as integer)`.as("grounded_active_claims"),
activeReferenceClaims:
sql<number>`cast(count(distinct ${unionClaims.claimId}) filter (
where ${unionClaims.status} = 'active' and ${unionClaims.scope} = 'reference'
) as integer)`.as("active_reference_claims"),
activePersonalClaims:
sql<number>`cast(count(distinct ${unionClaims.claimId}) filter (
where ${unionClaims.status} = 'active' and ${unionClaims.scope} = 'personal'
) as integer)`.as("active_personal_claims"),
hasAlias: sql<boolean>`bool_or(${aliases.id} is not null)`.as(
"has_alias",
),
hasSourceLink: sql<boolean>`bool_or(${sourceLinks.id} is not null)`.as(
"has_source_link",
),
})
.from(nodes)
.leftJoin(nodeMetadata, eq(nodeMetadata.nodeId, nodes.id))
.leftJoin(unionClaims, eq(unionClaims.nodeId, nodes.id))References
- Avoid using OR conditions in database joins (e.g., in PostgreSQL) as it can severely degrade query performance by preventing efficient index usage. If the dataset is small or capped by a limit, resolve candidate IDs in-memory and perform a primary key lookup using an in-array query instead.
|
Re: the
If a power user's claim count ever grows enough to make this measurable, the explicit Generated by Claude Code |
Summary
Introduces
pruneStaleNodes, a deterministic, preview-then-apply garbage collection sweep for accreted graph cruft. Unlike the existing orphan pruning which only removes nodes with zero evidence, this feature scores every entity/task node on staleness, isolation, provenance quality, and claim decay—then prunes the disposable tail while protecting recent, well-connected, and identity-critical nodes.Key Changes
New job:
src/lib/jobs/prune-stale-nodes.tsscore = 0.40·staleness + 0.25·isolation + 0.20·weakProvenance + 0.15·decayminIdleDays, nodes with open task statuses, user self-identity nodes, and (optionally) reference-scope nodeslimitNew schema:
src/lib/schemas/prune-stale-nodes.tsaggressiveness(0–1 knob), explicitminScore,minIdleDaysrecency floor,stalenessHorizonDays,includeReferenceflag, and pagination controlsappliedThreshold,scannedCount,candidateCount,deletedCount,hasMore, and acandidatessample with per-nodereasonsNew endpoint:
src/routes/maintenance/prune-stale-nodes.post.tsPOST /maintenance/prune-stale-nodesSDK integration:
src/sdk/memory-client.tspruneStaleNodes()method with full JSDoc explaining the preview-then-apply workflowComprehensive test suite:
src/lib/jobs/prune-stale-nodes.test.tsDocumentation:
README.mdandCHANGELOG.mdImplementation Details
count(DISTINCT claims.id)to avoid row fan-out from joinsreasonsarray explaining its score componentshttps://claude.ai/code/session_013rPD8Yg169QChrZA4mjQXg