Skip to content

feat(kb-enrich): pluggable collector architecture#31

Draft
stephengolub wants to merge 2 commits into
athal7:mainfrom
stephengolub:feat/kb-enrich-pluggable-collectors
Draft

feat(kb-enrich): pluggable collector architecture#31
stephengolub wants to merge 2 commits into
athal7:mainfrom
stephengolub:feat/kb-enrich-pluggable-collectors

Conversation

@stephengolub

Copy link
Copy Markdown
Contributor

Summary

Refactors /kb-enrich into a two-layer design: a thin orchestrator command plus a directory of self-contained collector files, one per data source.

Changes

commands/kb-enrich.md (rewritten as 5-step orchestrator)

The command no longer hard-codes any sources. It now:

  1. Resolves KB_ROOT and a re-run guard
  2. Resolves the date range (same gap-healing logic as before)
  3. Loads collectors from ~/.config/opencode/kb-collectors/*.md — reads frontmatter to get name, enabled, priority; sorts by priority
  4. Runs collectors in order, applying each one's embedded query recipe and triage rules
  5. Writes journal, profiles, decisions, and action items

kb-collectors/ (new directory, 7 collector files)

File Priority Source
openspec.md 0 OpenSpec durable store — builds session exclusion set first
granola.md 1 Meeting notes, decisions, people-contact
slack.md 2 Informal decisions, action items
linear.md 3 Tickets, completed work
gh.md 4 Shipped PRs, reviews
opencode.md 5 Coding sessions (exclusion applied)
google-chat.md 6 Disabled by default; opt-in for Chat

Each file is self-contained: YAML frontmatter (name, enabled, priority, authoritative_for) + a body with query recipe, triage rules, and extraction rules.

Why

  • Adding a new source = drop a .md file in kb-collectors/. Zero changes to the orchestrator.
  • Per-machine config (org lists, workspace slugs, token paths) lives in each collector's frontmatter — no need to touch the orchestrator.
  • Disabling a source = enabled: false in frontmatter.
  • The orchestrator stays stable; collectors evolve independently.

The session exclusion (OpenSpec dedup) logic is preserved and works the same way — the openspec collector runs at priority 0 and builds the exclusion set before opencode runs at priority 5.

Refactors /kb-enrich into a two-layer design: a thin orchestrator command
plus a directory of self-contained collector files, one per data source.

Changes:
- commands/kb-enrich.md: rewritten as a 5-step orchestrator (resolve config,
  resolve date range, load collectors, run collectors, enrichment). Sources
  are no longer hard-coded; the orchestrator reads whatever collectors are
  present in kb-collectors/.
- kb-collectors/ (new): one .md file per source, each with YAML frontmatter
  (name, enabled, priority, authoritative_for) and a body with query recipe,
  triage rules, and extraction rules.
  - openspec.md  (priority 0) — builds session exclusion set first
  - granola.md   (priority 1) — meetings, decisions, people-contact
  - slack.md     (priority 2) — informal decisions, action items
  - linear.md    (priority 3) — tickets, completed work
  - gh.md        (priority 4) — shipped PRs, reviews
  - opencode.md  (priority 5) — coding sessions (with exclusion applied)
  - google-chat.md (priority 6, disabled by default) — opt-in for Chat

To add a new source: drop a .md file into kb-collectors/. No changes to the
orchestrator needed. To disable a source or configure per-machine values
(org lists, workspace slugs, token paths), edit the collector's frontmatter.

@athal7 athal7 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the overall shape, a few things i would probably do differently, and a few things that aren't relevant to my dotfiles

1. **Collect excluded worktrees.** For each date being enriched, read every `~/.local/share/kb/openspec/*/changes/archive/<date>-*/kb-meta.yaml` and collect its `worktree:` value (the absolute repo/worktree root, stamped at archive). That set is the exclusion list.
2. **Skip those sessions.** When scanning **opencode** sessions, a session is identified by its `directory` column in the `session` table of `~/.local/share/opencode/opencode.db`. SKIP any session whose `directory` is in the exclusion set — for those, narrate from the change's `design.md`/specs, not the transcript. Only sessions NOT covered by an archived change get a transcript read.
3. **Filter at query time.** Pass the collected worktrees as the `NOT IN (...)` list and bound by the date window (`time_updated` is epoch-ms):
Enrich the gap since the last run, not a hard single day. The most recent `$KB_ROOT/journal/YYYY-MM-DD.md` is the last-run marker: enrich each date from (last journal date + 1) through today, inclusive. This makes a Monday run sweep the trailing weekend and lets a skipped run self-heal on the next run. If no prior journal exists, default to today. An explicit date or range in `$ARGUMENTS` overrides this.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line conflicts with the re-run guard above. i like this version better

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed — the date-range logic in Step 1 already covers this (existing journal files become the last-run marker, so a date that's already enriched is simply last date + 1 and won't re-run).


## Enrichment Steps
1. Parse the YAML frontmatter to get `name`, `enabled`, `priority`, `authoritative_for`.
2. Skip any collector with `enabled: false`.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think external config may be better, so as to not dirty chezmoi state

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you mean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped enabled from frontmatter entirely — presence of the file is now the toggle. Disabling a collector means not adding/removing the file, which never touches a chezmoi-tracked source file.

Comment thread dot_config/opencode/kb-collectors/gh.md Outdated
# orgs: GitHub orgs to scope searches to. When set, each org is added as
# `org:NAME` to the search query. Leave empty for no org filter (searches
# across all repos you have access to).
orgs: []

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a separate github org config in local vars, would be great to reuse that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yea, you have that set up for chezmoi, I don't. I probably SHOULD.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated — now reads from chezmoi data --format json | jq '[.orgs | keys[]]'. Falls back to no org filter if orgs is empty.

name: gh
enabled: true
priority: 4
authoritative_for: [shipped-code, reviews]

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand how this is used

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's metadata the orchestrator doesn't act on — mainly documentation of what each collector is the source of truth for (to help a reader understand dedup decisions, like why openspec sessions take precedence over opencode session transcripts). Happy to drop it if it feels like overhead without benefit.

---
name: gh
enabled: true
priority: 4

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we're getting a ton from priority?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main place it matters is openspec (0) needing to run before opencode (5) so the session exclusion set is built first. The middle-ground ones (slack, linear, gh) don't have hard ordering requirements. Happy to simplify — could drop priority and just document the openspec-before-opencode dependency explicitly in the orchestrator instead.

@@ -0,0 +1,38 @@
---

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think collectors should be user specific, and i don't use this (or granola)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yea, those shouldn't have gotten shipped to you, I'll remove them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed both. Agreed — collectors should be user-specific files you add to your own dotfiles, not shipped in the base set.

If `workspace` is set in frontmatter, pass it with `--workspace`:

```bash
linear issue mine --updated-after YYYY-MM-DD --all-states --no-pager

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't use linear cli :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to the GraphQL API via the linear skill — same approach as your other Linear integrations.

- Remove re-run guard (redundant with Step 1 date-range logic)
- Drop 'enabled' frontmatter field; presence/absence of file is the toggle
- gh.md: read orgs from chezmoi data instead of hardcoded frontmatter
- linear.md: switch from linear CLI to GraphQL API via 'linear' skill
- Remove google-chat.md and granola.md (user-specific collectors)

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the /kb-enrich OpenCode command into an orchestrator-style recipe that discovers and runs per-source “collector” markdown files under ~/.config/opencode/kb-collectors/, enabling source-specific query/extraction rules and preserving the OpenSpec-first session dedup flow.

Changes:

  • Rewrites kb-enrich into stepwise orchestration: resolve config, resolve date window, load collectors, apply OpenSpec-based session exclusion, run collectors, then write KB outputs.
  • Adds collector recipe files for OpenSpec, OpenCode sessions, Slack, Linear, and GitHub.
  • Moves source-specific querying/triage/extraction guidance into self-contained collector documents.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
dot_config/opencode/commands/kb-enrich.md Orchestrator-style KB enrichment flow, including collector loading + OpenSpec dedup sequencing.
dot_config/opencode/kb-collectors/openspec.md Priority-0 collector defining OpenSpec archive reads and building the session exclusion set.
dot_config/opencode/kb-collectors/opencode.md Collector for querying the local OpenCode session DB with exclusion applied upstream.
dot_config/opencode/kb-collectors/slack.md Collector recipe for Slack search + thread reads and extraction/skip guidance.
dot_config/opencode/kb-collectors/linear.md Collector recipe for Linear issue activity via the Linear skill/GraphQL.
dot_config/opencode/kb-collectors/gh.md Collector recipe for GitHub PR/review activity via gh search prs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +28
Collectors live in `~/.config/opencode/kb-collectors/`. Each file is a self-contained markdown recipe that describes one data source. Read all `*.md` files in that directory and sort them by the `priority` field in their YAML frontmatter (lower number = runs first). Read each file's body now so you can apply its query recipe and extraction rules during collection.

**Benign failure modes** (neither loses correctness): a missed match (stale/absent `kb-meta.yaml`) just wastes one transcript read; an over-match (a session in an excluded worktree that wasn't really part of the change) just relies on the better, distilled artifact instead of the transcript.
Some collectors perform a **runtime enabled check** at the start of their body (e.g. verifying a token or config value exists before proceeding). Honor those checks: if a collector's body says to skip, log the reason and move on.

## Enrichment Steps
> **To add a new data source:** drop a new `.md` file into `~/.config/opencode/kb-collectors/`. No changes to this orchestrator needed. To disable a source, remove or don't add its collector file.
Comment on lines +22 to +24
## Step 2 — Load collectors

```sql
SELECT id, directory, title, time_updated
FROM session
WHERE time_updated BETWEEN :start_ms AND :end_ms
AND directory NOT IN ('/abs/worktree/a', '/abs/worktree/b');
-- returned sessions are the ONLY ones that need a transcript read;
-- excluded directories are covered by the durable change artifacts instead.
```
Collectors live in `~/.config/opencode/kb-collectors/`. Each file is a self-contained markdown recipe that describes one data source. Read all `*.md` files in that directory and sort them by the `priority` field in their YAML frontmatter (lower number = runs first). Read each file's body now so you can apply its query recipe and extraction rules during collection.
Comment on lines +18 to +20
for meta in $KB_ROOT/openspec/*/changes/archive/<date>-*/kb-meta.yaml; do
grep '^worktree:' "$meta" | awk '{print $2}'
done

Before running the `opencode` collector, build the session exclusion set from the `openspec` collector output (priority 0, runs first):

For each date being enriched, read `$KB_ROOT/openspec/*/changes/archive/<date>-*/kb-meta.yaml` and collect every `worktree:` value. This set is passed into the `opencode` collector as the `NOT IN (...)` list. Sessions in excluded worktrees are already covered by the durable OpenSpec change artifacts (`design.md`/specs); they do not need a transcript read.
Comment on lines +10 to +14
Token is in `~/.config/team-context-mcp/.env` as `SLACK_USER_TOKEN`. Set `SLACK_USER_ID` to your Slack user ID (find it in your Slack profile → "Copy member ID").

```bash
SLACK_TOKEN=$(grep SLACK_USER_TOKEN ~/.config/team-context-mcp/.env | cut -d= -f2)
SLACK_USER_ID="<your-slack-user-id>" # e.g. U01ABC23DEF
Comment on lines +6 to +8
# skip_bots: commit authors / PR actors to ignore
skip_bots: [dependabot]
---
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants