Skip to content

Add deterministic memory garbage collection with pruneStaleNodes#60

Merged
marcelsamyn merged 1 commit into
mainfrom
claude/adoring-euler-hj5blp
Jun 9, 2026
Merged

Add deterministic memory garbage collection with pruneStaleNodes#60
marcelsamyn merged 1 commit into
mainfrom
claude/adoring-euler-hj5blp

Conversation

@marcelsamyn

Copy link
Copy Markdown
Owner

Summary

Introduces pruneStaleNodes, a deterministic, preview-then-apply garbage collection sweep for accreted graph cruft. Unlike the existing orphan pruning which only removes nodes with zero evidence, this feature scores every entity/task node on staleness, isolation, provenance quality, and claim decay—then prunes the disposable tail while protecting recent, well-connected, and identity-critical nodes.

Key Changes

  • New job: src/lib/jobs/prune-stale-nodes.ts

    • Implements a transparent scoring algorithm: score = 0.40·staleness + 0.25·isolation + 0.20·weakProvenance + 0.15·decay
    • Scores nodes based on age, connectivity, grounded vs. assistant-inferred claims, and superseded claim ratio
    • Protects nodes active within minIdleDays, nodes with open task statuses, user self-identity nodes, and (optionally) reference-scope nodes
    • Supports dry-run preview and destructive deletion modes with pagination via limit
    • Deletion cascades through claims, source links, aliases, and embeddings via foreign keys
  • New schema: src/lib/schemas/prune-stale-nodes.ts

    • Request schema with tuning parameters: aggressiveness (0–1 knob), explicit minScore, minIdleDays recency floor, stalenessHorizonDays, includeReference flag, and pagination controls
    • Response schema includes appliedThreshold, scannedCount, candidateCount, deletedCount, hasMore, and a candidates sample with per-node reasons
  • New endpoint: src/routes/maintenance/prune-stale-nodes.post.ts

    • HTTP handler for POST /maintenance/prune-stale-nodes
  • SDK integration: src/sdk/memory-client.ts

    • Added pruneStaleNodes() method with full JSDoc explaining the preview-then-apply workflow
  • Comprehensive test suite: src/lib/jobs/prune-stale-nodes.test.ts

    • 433-line DB integration test suite (runs against throwaway Postgres)
    • Tests dry-run scoring, protection rules (recency floor, open tasks, self identity, reference scope), aggressiveness tuning, and destructive deletion
    • Validates that stale nodes are scored correctly while protected nodes remain intact
  • Documentation: README.md and CHANGELOG.md

    • Added section explaining the pruning feature, scoring formula, and usage example
    • Documented in CHANGELOG under "Added"

Implementation Details

  • Scoring is LLM-free: Fast, cheap, and repeatable over large graphs; enables stable previews before deletion
  • Protection rules are explicit: Nodes are never pruned if they have open task status, are the user's self-identity, are too recent, or (by default) are reference-scoped
  • Deterministic ordering: Candidates sorted by score (descending) with node ID as stable tiebreaker for reproducibility
  • Efficient aggregation: Single SQL pass with count(DISTINCT claims.id) to avoid row fan-out from joins
  • Transparent reasoning: Each candidate includes human-readable reasons array explaining its score components

https://claude.ai/code/session_013rPD8Yg169QChrZA4mjQXg

… graph cruft

Add `POST /maintenance/prune-stale-nodes` + `MemoryClient.pruneStaleNodes`: a
preview-then-apply, LLM-free GC pass that scores every entity/task node on a
transparent weighted sum of staleness, isolation, weak provenance, and claim
decay, then prunes the disposable tail.

One `aggressiveness` knob maps to the score threshold (or pin `minScore`).
`dryRun` defaults true and returns ranked candidates with per-node `reasons`;
re-call with `dryRun: false` to delete (cascades through claims, source links,
aliases, embeddings). Protected: nodes active within `minIdleDays`, open tasks,
the user's self-identity node(s), and reference-scope nodes unless opted in.

Complements `pruneOrphanNodes` (evidence-free) and `dedupSweep` (exact-label).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a deterministic, preview-then-apply memory garbage collection sweep via the new POST /maintenance/prune-stale-nodes endpoint and MemoryClient.pruneStaleNodes SDK method. It scores entity and task nodes based on staleness, isolation, weak provenance, and claim decay to prune low-value nodes. The review feedback highlights a potential performance bottleneck in the database query within scoreNodeRows due to an OR condition in the join with the claims table, suggesting a refactor to use a UNION ALL subquery to leverage indexes more efficiently.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +92 to +139
const rows = await db
.select({
id: nodes.id,
nodeType: nodes.nodeType,
label: nodeMetadata.label,
createdAt: nodes.createdAt,
lastClaimAt: sql<string | null>`max(${claims.statedAt})`.as(
"last_claim_at",
),
totalClaims:
sql<number>`cast(count(distinct ${claims.id}) as integer)`.as(
"total_claims",
),
activeClaims: sql<number>`cast(count(distinct ${claims.id}) filter (
where ${claims.status} = 'active'
) as integer)`.as("active_claims"),
supersededClaims: sql<number>`cast(count(distinct ${claims.id}) filter (
where ${claims.status} <> 'active'
) as integer)`.as("superseded_claims"),
groundedActiveClaims:
sql<number>`cast(count(distinct ${claims.id}) filter (
where ${claims.status} = 'active'
and ${claims.assertedByKind} not in ('assistant_inferred', 'system')
) as integer)`.as("grounded_active_claims"),
activeReferenceClaims:
sql<number>`cast(count(distinct ${claims.id}) filter (
where ${claims.status} = 'active' and ${claims.scope} = 'reference'
) as integer)`.as("active_reference_claims"),
activePersonalClaims:
sql<number>`cast(count(distinct ${claims.id}) filter (
where ${claims.status} = 'active' and ${claims.scope} = 'personal'
) as integer)`.as("active_personal_claims"),
hasAlias: sql<boolean>`bool_or(${aliases.id} is not null)`.as(
"has_alias",
),
hasSourceLink: sql<boolean>`bool_or(${sourceLinks.id} is not null)`.as(
"has_source_link",
),
})
.from(nodes)
.leftJoin(nodeMetadata, eq(nodeMetadata.nodeId, nodes.id))
.leftJoin(
claims,
and(
eq(claims.userId, params.userId),
sql`(${claims.subjectNodeId} = ${nodes.id} or ${claims.objectNodeId} = ${nodes.id})`,
),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Avoid using OR conditions in database joins (e.g., in PostgreSQL) as it can severely degrade query performance by preventing efficient index usage. Since we are scanning all nodes for the user to score them, we cannot resolve candidate IDs in-memory first. Instead, we can use a UNION ALL subquery to aggregate subject and object claims, allowing PostgreSQL to use indexes on subject_node_id and object_node_id perfectly.

  const unionClaims = db
    .select({
      nodeId: sql<string>`node_id`.as("node_id"),
      claimId: sql<string>`claim_id`.as("claim_id"),
      status: sql<string>`status`.as("status"),
      assertedByKind: sql<string>`asserted_by_kind`.as("asserted_by_kind"),
      scope: sql<string>`scope`.as("scope"),
      statedAt: sql<string>`stated_at`.as("stated_at"),
    })
    .from(
      sql`(
        SELECT subject_node_id AS node_id, id AS claim_id, status, asserted_by_kind, scope, stated_at
        FROM claims
        WHERE user_id = ${params.userId}
        UNION ALL
        SELECT object_node_id AS node_id, id AS claim_id, status, asserted_by_kind, scope, stated_at
        FROM claims
        WHERE user_id = ${params.userId} AND object_node_id IS NOT NULL
      ) as union_claims` 
    )
    .as("union_claims");

  const rows = await db
    .select({
      id: nodes.id,
      nodeType: nodes.nodeType,
      label: nodeMetadata.label,
      createdAt: nodes.createdAt,
      lastClaimAt: sql<string | null>`max(${unionClaims.statedAt})`.as(
        "last_claim_at",
      ),
      totalClaims:
        sql<number>`cast(count(distinct ${unionClaims.claimId}) as integer)`.as(
          "total_claims",
        ),
      activeClaims: sql<number>`cast(count(distinct ${unionClaims.claimId}) filter (
        where ${unionClaims.status} = 'active'
      ) as integer)`.as("active_claims"),
      supersededClaims: sql<number>`cast(count(distinct ${unionClaims.claimId}) filter (
        where ${unionClaims.status} <> 'active'
      ) as integer)`.as("superseded_claims"),
      groundedActiveClaims:
        sql<number>`cast(count(distinct ${unionClaims.claimId}) filter (
        where ${unionClaims.status} = 'active'
          and ${unionClaims.assertedByKind} not in ('assistant_inferred', 'system')
      ) as integer)`.as("grounded_active_claims"),
      activeReferenceClaims:
        sql<number>`cast(count(distinct ${unionClaims.claimId}) filter (
        where ${unionClaims.status} = 'active' and ${unionClaims.scope} = 'reference'
      ) as integer)`.as("active_reference_claims"),
      activePersonalClaims:
        sql<number>`cast(count(distinct ${unionClaims.claimId}) filter (
        where ${unionClaims.status} = 'active' and ${unionClaims.scope} = 'personal'
      ) as integer)`.as("active_personal_claims"),
      hasAlias: sql<boolean>`bool_or(${aliases.id} is not null)`.as(
        "has_alias",
      ),
      hasSourceLink: sql<boolean>`bool_or(${sourceLinks.id} is not null)`.as(
        "has_source_link",
      ),
    })
    .from(nodes)
    .leftJoin(nodeMetadata, eq(nodeMetadata.nodeId, nodes.id))
    .leftJoin(unionClaims, eq(unionClaims.nodeId, nodes.id))
References
  1. Avoid using OR conditions in database joins (e.g., in PostgreSQL) as it can severely degrade query performance by preventing efficient index usage. If the dataset is small or capped by a limit, resolve candidate IDs in-memory and perform a primary key lookup using an in-array query instead.

Copy link
Copy Markdown
Owner Author

Re: the OR-in-join flag on scoreNodeRows — keeping it as-is, for these reasons:

  • It's a full-scan aggregate, not a selective lookup. This query deliberately scores every node for the user, so it reads all of that user's nodes and all their claims regardless of plan. The "OR prevents index usage" warning targets selective queries where an index would prune rows — there's nothing to prune here, so a hash/bitmap-OR join over the userId-filtered claim set is already about optimal. Both join columns are indexed (claims_user_id_subject_node_id_idx, claims_user_id_object_node_id_idx), so the planner can bitmap-OR them if it prefers a nested loop.
  • Cold maintenance path. This is the occasional preview-then-apply GC sweep, not a request-time hot path — sub-second over ~8K nodes either way.
  • The UNION ALL rewrite is functionally equivalent but costlier to maintain. count(distinct claimId) dedupes the subject==object case identically, so there's no correctness gain; the rewrite mainly trades the type-safe query builder for a raw-SQL subquery that's harder to test.

If a power user's claim count ever grows enough to make this measurable, the explicit UNION ALL plan is the right tool — noting it here so it's easy to revisit then.


Generated by Claude Code

@marcelsamyn marcelsamyn merged commit f9c946a into main Jun 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants