Add deterministic memory garbage collection with pruneStaleNodes by marcelsamyn · Pull Request #60 · marcelsamyn/assistant-memory

marcelsamyn · 2026-06-09T14:34:00Z

Summary

Introduces pruneStaleNodes, a deterministic, preview-then-apply garbage collection sweep for accreted graph cruft. Unlike the existing orphan pruning which only removes nodes with zero evidence, this feature scores every entity/task node on staleness, isolation, provenance quality, and claim decay—then prunes the disposable tail while protecting recent, well-connected, and identity-critical nodes.

Key Changes

New job: src/lib/jobs/prune-stale-nodes.ts
- Implements a transparent scoring algorithm: score = 0.40·staleness + 0.25·isolation + 0.20·weakProvenance + 0.15·decay
- Scores nodes based on age, connectivity, grounded vs. assistant-inferred claims, and superseded claim ratio
- Protects nodes active within minIdleDays, nodes with open task statuses, user self-identity nodes, and (optionally) reference-scope nodes
- Supports dry-run preview and destructive deletion modes with pagination via limit
- Deletion cascades through claims, source links, aliases, and embeddings via foreign keys
New schema: src/lib/schemas/prune-stale-nodes.ts
- Request schema with tuning parameters: aggressiveness (0–1 knob), explicit minScore, minIdleDays recency floor, stalenessHorizonDays, includeReference flag, and pagination controls
- Response schema includes appliedThreshold, scannedCount, candidateCount, deletedCount, hasMore, and a candidates sample with per-node reasons
New endpoint: src/routes/maintenance/prune-stale-nodes.post.ts
- HTTP handler for POST /maintenance/prune-stale-nodes
SDK integration: src/sdk/memory-client.ts
- Added pruneStaleNodes() method with full JSDoc explaining the preview-then-apply workflow
Comprehensive test suite: src/lib/jobs/prune-stale-nodes.test.ts
- 433-line DB integration test suite (runs against throwaway Postgres)
- Tests dry-run scoring, protection rules (recency floor, open tasks, self identity, reference scope), aggressiveness tuning, and destructive deletion
- Validates that stale nodes are scored correctly while protected nodes remain intact
Documentation: README.md and CHANGELOG.md
- Added section explaining the pruning feature, scoring formula, and usage example
- Documented in CHANGELOG under "Added"

Implementation Details

Scoring is LLM-free: Fast, cheap, and repeatable over large graphs; enables stable previews before deletion
Protection rules are explicit: Nodes are never pruned if they have open task status, are the user's self-identity, are too recent, or (by default) are reference-scoped
Deterministic ordering: Candidates sorted by score (descending) with node ID as stable tiebreaker for reproducibility
Efficient aggregation: Single SQL pass with count(DISTINCT claims.id) to avoid row fan-out from joins
Transparent reasoning: Each candidate includes human-readable reasons array explaining its score components

https://claude.ai/code/session_013rPD8Yg169QChrZA4mjQXg

… graph cruft Add `POST /maintenance/prune-stale-nodes` + `MemoryClient.pruneStaleNodes`: a preview-then-apply, LLM-free GC pass that scores every entity/task node on a transparent weighted sum of staleness, isolation, weak provenance, and claim decay, then prunes the disposable tail. One `aggressiveness` knob maps to the score threshold (or pin `minScore`). `dryRun` defaults true and returns ranked candidates with per-node `reasons`; re-call with `dryRun: false` to delete (cascades through claims, source links, aliases, embeddings). Protected: nodes active within `minIdleDays`, open tasks, the user's self-identity node(s), and reference-scope nodes unless opted in. Complements `pruneOrphanNodes` (evidence-free) and `dedupSweep` (exact-label).

gemini-code-assist

Code Review

This pull request introduces a deterministic, preview-then-apply memory garbage collection sweep via the new POST /maintenance/prune-stale-nodes endpoint and MemoryClient.pruneStaleNodes SDK method. It scores entity and task nodes based on staleness, isolation, weak provenance, and claim decay to prune low-value nodes. The review feedback highlights a potential performance bottleneck in the database query within scoreNodeRows due to an OR condition in the join with the claims table, suggesting a refactor to use a UNION ALL subquery to leverage indexes more efficiently.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T14:35:35Z

+  const rows = await db
+    .select({
+      id: nodes.id,
+      nodeType: nodes.nodeType,
+      label: nodeMetadata.label,
+      createdAt: nodes.createdAt,
+      lastClaimAt: sql<string | null>`max(${claims.statedAt})`.as(
+        "last_claim_at",
+      ),
+      totalClaims:
+        sql<number>`cast(count(distinct ${claims.id}) as integer)`.as(
+          "total_claims",
+        ),
+      activeClaims: sql<number>`cast(count(distinct ${claims.id}) filter (
+        where ${claims.status} = 'active'
+      ) as integer)`.as("active_claims"),
+      supersededClaims: sql<number>`cast(count(distinct ${claims.id}) filter (
+        where ${claims.status} <> 'active'
+      ) as integer)`.as("superseded_claims"),
+      groundedActiveClaims:
+        sql<number>`cast(count(distinct ${claims.id}) filter (
+        where ${claims.status} = 'active'
+          and ${claims.assertedByKind} not in ('assistant_inferred', 'system')
+      ) as integer)`.as("grounded_active_claims"),
+      activeReferenceClaims:
+        sql<number>`cast(count(distinct ${claims.id}) filter (
+        where ${claims.status} = 'active' and ${claims.scope} = 'reference'
+      ) as integer)`.as("active_reference_claims"),
+      activePersonalClaims:
+        sql<number>`cast(count(distinct ${claims.id}) filter (
+        where ${claims.status} = 'active' and ${claims.scope} = 'personal'
+      ) as integer)`.as("active_personal_claims"),
+      hasAlias: sql<boolean>`bool_or(${aliases.id} is not null)`.as(
+        "has_alias",
+      ),
+      hasSourceLink: sql<boolean>`bool_or(${sourceLinks.id} is not null)`.as(
+        "has_source_link",
+      ),
+    })
+    .from(nodes)
+    .leftJoin(nodeMetadata, eq(nodeMetadata.nodeId, nodes.id))
+    .leftJoin(
+      claims,
+      and(
+        eq(claims.userId, params.userId),
+        sql`(${claims.subjectNodeId} = ${nodes.id} or ${claims.objectNodeId} = ${nodes.id})`,
+      ),
+    )


Avoid using OR conditions in database joins (e.g., in PostgreSQL) as it can severely degrade query performance by preventing efficient index usage. Since we are scanning all nodes for the user to score them, we cannot resolve candidate IDs in-memory first. Instead, we can use a UNION ALL subquery to aggregate subject and object claims, allowing PostgreSQL to use indexes on subject_node_id and object_node_id perfectly.

const unionClaims = db .select({ nodeId: sql<string>`node_id`.as("node_id"), claimId: sql<string>`claim_id`.as("claim_id"), status: sql<string>`status`.as("status"), assertedByKind: sql<string>`asserted_by_kind`.as("asserted_by_kind"), scope: sql<string>`scope`.as("scope"), statedAt: sql<string>`stated_at`.as("stated_at"), }) .from( sql`( SELECT subject_node_id AS node_id, id AS claim_id, status, asserted_by_kind, scope, stated_at FROM claims WHERE user_id = ${params.userId} UNION ALL SELECT object_node_id AS node_id, id AS claim_id, status, asserted_by_kind, scope, stated_at FROM claims WHERE user_id = ${params.userId} AND object_node_id IS NOT NULL ) as union_claims` ) .as("union_claims"); const rows = await db .select({ id: nodes.id, nodeType: nodes.nodeType, label: nodeMetadata.label, createdAt: nodes.createdAt, lastClaimAt: sql<string | null>`max(${unionClaims.statedAt})`.as( "last_claim_at", ), totalClaims: sql<number>`cast(count(distinct ${unionClaims.claimId}) as integer)`.as( "total_claims", ), activeClaims: sql<number>`cast(count(distinct ${unionClaims.claimId}) filter ( where ${unionClaims.status} = 'active' ) as integer)`.as("active_claims"), supersededClaims: sql<number>`cast(count(distinct ${unionClaims.claimId}) filter ( where ${unionClaims.status} <> 'active' ) as integer)`.as("superseded_claims"), groundedActiveClaims: sql<number>`cast(count(distinct ${unionClaims.claimId}) filter ( where ${unionClaims.status} = 'active' and ${unionClaims.assertedByKind} not in ('assistant_inferred', 'system') ) as integer)`.as("grounded_active_claims"), activeReferenceClaims: sql<number>`cast(count(distinct ${unionClaims.claimId}) filter ( where ${unionClaims.status} = 'active' and ${unionClaims.scope} = 'reference' ) as integer)`.as("active_reference_claims"), activePersonalClaims: sql<number>`cast(count(distinct ${unionClaims.claimId}) filter ( where ${unionClaims.status} = 'active' and ${unionClaims.scope} = 'personal' ) as integer)`.as("active_personal_claims"), hasAlias: sql<boolean>`bool_or(${aliases.id} is not null)`.as( "has_alias", ), hasSourceLink: sql<boolean>`bool_or(${sourceLinks.id} is not null)`.as( "has_source_link", ), }) .from(nodes) .leftJoin(nodeMetadata, eq(nodeMetadata.nodeId, nodes.id)) .leftJoin(unionClaims, eq(unionClaims.nodeId, nodes.id))

References

Avoid using OR conditions in database joins (e.g., in PostgreSQL) as it can severely degrade query performance by preventing efficient index usage. If the dataset is small or capped by a limit, resolve candidate IDs in-memory and perform a primary key lookup using an in-array query instead.

marcelsamyn · 2026-06-09T14:41:20Z

Re: the OR-in-join flag on scoreNodeRows — keeping it as-is, for these reasons:

It's a full-scan aggregate, not a selective lookup. This query deliberately scores every node for the user, so it reads all of that user's nodes and all their claims regardless of plan. The "OR prevents index usage" warning targets selective queries where an index would prune rows — there's nothing to prune here, so a hash/bitmap-OR join over the userId-filtered claim set is already about optimal. Both join columns are indexed (claims_user_id_subject_node_id_idx, claims_user_id_object_node_id_idx), so the planner can bitmap-OR them if it prefers a nested loop.
Cold maintenance path. This is the occasional preview-then-apply GC sweep, not a request-time hot path — sub-second over ~8K nodes either way.
The UNION ALL rewrite is functionally equivalent but costlier to maintain. count(distinct claimId) dedupes the subject==object case identically, so there's no correctness gain; the rewrite mainly trades the type-safe query builder for a raw-SQL subquery that's harder to test.

If a power user's claim count ever grows enough to make this measurable, the explicit UNION ALL plan is the right tool — noting it here so it's easy to revisit then.

Generated by Claude Code

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

marcelsamyn merged commit f9c946a into main Jun 9, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deterministic memory garbage collection with pruneStaleNodes#60

Add deterministic memory garbage collection with pruneStaleNodes#60
marcelsamyn merged 1 commit into
mainfrom
claude/adoring-euler-hj5blp

marcelsamyn commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

marcelsamyn commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcelsamyn commented Jun 9, 2026

Summary

Key Changes

Implementation Details

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

marcelsamyn commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants