feat(kb-enrich): pluggable collector architecture#31
Conversation
Refactors /kb-enrich into a two-layer design: a thin orchestrator command plus a directory of self-contained collector files, one per data source. Changes: - commands/kb-enrich.md: rewritten as a 5-step orchestrator (resolve config, resolve date range, load collectors, run collectors, enrichment). Sources are no longer hard-coded; the orchestrator reads whatever collectors are present in kb-collectors/. - kb-collectors/ (new): one .md file per source, each with YAML frontmatter (name, enabled, priority, authoritative_for) and a body with query recipe, triage rules, and extraction rules. - openspec.md (priority 0) — builds session exclusion set first - granola.md (priority 1) — meetings, decisions, people-contact - slack.md (priority 2) — informal decisions, action items - linear.md (priority 3) — tickets, completed work - gh.md (priority 4) — shipped PRs, reviews - opencode.md (priority 5) — coding sessions (with exclusion applied) - google-chat.md (priority 6, disabled by default) — opt-in for Chat To add a new source: drop a .md file into kb-collectors/. No changes to the orchestrator needed. To disable a source or configure per-machine values (org lists, workspace slugs, token paths), edit the collector's frontmatter.
athal7
left a comment
There was a problem hiding this comment.
i like the overall shape, a few things i would probably do differently, and a few things that aren't relevant to my dotfiles
| 1. **Collect excluded worktrees.** For each date being enriched, read every `~/.local/share/kb/openspec/*/changes/archive/<date>-*/kb-meta.yaml` and collect its `worktree:` value (the absolute repo/worktree root, stamped at archive). That set is the exclusion list. | ||
| 2. **Skip those sessions.** When scanning **opencode** sessions, a session is identified by its `directory` column in the `session` table of `~/.local/share/opencode/opencode.db`. SKIP any session whose `directory` is in the exclusion set — for those, narrate from the change's `design.md`/specs, not the transcript. Only sessions NOT covered by an archived change get a transcript read. | ||
| 3. **Filter at query time.** Pass the collected worktrees as the `NOT IN (...)` list and bound by the date window (`time_updated` is epoch-ms): | ||
| Enrich the gap since the last run, not a hard single day. The most recent `$KB_ROOT/journal/YYYY-MM-DD.md` is the last-run marker: enrich each date from (last journal date + 1) through today, inclusive. This makes a Monday run sweep the trailing weekend and lets a skipped run self-heal on the next run. If no prior journal exists, default to today. An explicit date or range in `$ARGUMENTS` overrides this. |
There was a problem hiding this comment.
this line conflicts with the re-run guard above. i like this version better
There was a problem hiding this comment.
Removed — the date-range logic in Step 1 already covers this (existing journal files become the last-run marker, so a date that's already enriched is simply last date + 1 and won't re-run).
|
|
||
| ## Enrichment Steps | ||
| 1. Parse the YAML frontmatter to get `name`, `enabled`, `priority`, `authoritative_for`. | ||
| 2. Skip any collector with `enabled: false`. |
There was a problem hiding this comment.
i think external config may be better, so as to not dirty chezmoi state
There was a problem hiding this comment.
how do you mean?
There was a problem hiding this comment.
Dropped enabled from frontmatter entirely — presence of the file is now the toggle. Disabling a collector means not adding/removing the file, which never touches a chezmoi-tracked source file.
| # orgs: GitHub orgs to scope searches to. When set, each org is added as | ||
| # `org:NAME` to the search query. Leave empty for no org filter (searches | ||
| # across all repos you have access to). | ||
| orgs: [] |
There was a problem hiding this comment.
there is a separate github org config in local vars, would be great to reuse that
There was a problem hiding this comment.
oh yea, you have that set up for chezmoi, I don't. I probably SHOULD.
There was a problem hiding this comment.
Updated — now reads from chezmoi data --format json | jq '[.orgs | keys[]]'. Falls back to no org filter if orgs is empty.
| name: gh | ||
| enabled: true | ||
| priority: 4 | ||
| authoritative_for: [shipped-code, reviews] |
There was a problem hiding this comment.
not sure i understand how this is used
There was a problem hiding this comment.
It's metadata the orchestrator doesn't act on — mainly documentation of what each collector is the source of truth for (to help a reader understand dedup decisions, like why openspec sessions take precedence over opencode session transcripts). Happy to drop it if it feels like overhead without benefit.
| --- | ||
| name: gh | ||
| enabled: true | ||
| priority: 4 |
There was a problem hiding this comment.
not sure we're getting a ton from priority?
There was a problem hiding this comment.
The main place it matters is openspec (0) needing to run before opencode (5) so the session exclusion set is built first. The middle-ground ones (slack, linear, gh) don't have hard ordering requirements. Happy to simplify — could drop priority and just document the openspec-before-opencode dependency explicitly in the orchestrator instead.
| @@ -0,0 +1,38 @@ | |||
| --- | |||
There was a problem hiding this comment.
i think collectors should be user specific, and i don't use this (or granola)
There was a problem hiding this comment.
Oh, yea, those shouldn't have gotten shipped to you, I'll remove them.
There was a problem hiding this comment.
Removed both. Agreed — collectors should be user-specific files you add to your own dotfiles, not shipped in the base set.
| If `workspace` is set in frontmatter, pass it with `--workspace`: | ||
|
|
||
| ```bash | ||
| linear issue mine --updated-after YYYY-MM-DD --all-states --no-pager |
There was a problem hiding this comment.
Switched to the GraphQL API via the linear skill — same approach as your other Linear integrations.
- Remove re-run guard (redundant with Step 1 date-range logic) - Drop 'enabled' frontmatter field; presence/absence of file is the toggle - gh.md: read orgs from chezmoi data instead of hardcoded frontmatter - linear.md: switch from linear CLI to GraphQL API via 'linear' skill - Remove google-chat.md and granola.md (user-specific collectors)
There was a problem hiding this comment.
Pull request overview
Refactors the /kb-enrich OpenCode command into an orchestrator-style recipe that discovers and runs per-source “collector” markdown files under ~/.config/opencode/kb-collectors/, enabling source-specific query/extraction rules and preserving the OpenSpec-first session dedup flow.
Changes:
- Rewrites
kb-enrichinto stepwise orchestration: resolve config, resolve date window, load collectors, apply OpenSpec-based session exclusion, run collectors, then write KB outputs. - Adds collector recipe files for OpenSpec, OpenCode sessions, Slack, Linear, and GitHub.
- Moves source-specific querying/triage/extraction guidance into self-contained collector documents.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| dot_config/opencode/commands/kb-enrich.md | Orchestrator-style KB enrichment flow, including collector loading + OpenSpec dedup sequencing. |
| dot_config/opencode/kb-collectors/openspec.md | Priority-0 collector defining OpenSpec archive reads and building the session exclusion set. |
| dot_config/opencode/kb-collectors/opencode.md | Collector for querying the local OpenCode session DB with exclusion applied upstream. |
| dot_config/opencode/kb-collectors/slack.md | Collector recipe for Slack search + thread reads and extraction/skip guidance. |
| dot_config/opencode/kb-collectors/linear.md | Collector recipe for Linear issue activity via the Linear skill/GraphQL. |
| dot_config/opencode/kb-collectors/gh.md | Collector recipe for GitHub PR/review activity via gh search prs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Collectors live in `~/.config/opencode/kb-collectors/`. Each file is a self-contained markdown recipe that describes one data source. Read all `*.md` files in that directory and sort them by the `priority` field in their YAML frontmatter (lower number = runs first). Read each file's body now so you can apply its query recipe and extraction rules during collection. | ||
|
|
||
| **Benign failure modes** (neither loses correctness): a missed match (stale/absent `kb-meta.yaml`) just wastes one transcript read; an over-match (a session in an excluded worktree that wasn't really part of the change) just relies on the better, distilled artifact instead of the transcript. | ||
| Some collectors perform a **runtime enabled check** at the start of their body (e.g. verifying a token or config value exists before proceeding). Honor those checks: if a collector's body says to skip, log the reason and move on. | ||
|
|
||
| ## Enrichment Steps | ||
| > **To add a new data source:** drop a new `.md` file into `~/.config/opencode/kb-collectors/`. No changes to this orchestrator needed. To disable a source, remove or don't add its collector file. |
| ## Step 2 — Load collectors | ||
|
|
||
| ```sql | ||
| SELECT id, directory, title, time_updated | ||
| FROM session | ||
| WHERE time_updated BETWEEN :start_ms AND :end_ms | ||
| AND directory NOT IN ('/abs/worktree/a', '/abs/worktree/b'); | ||
| -- returned sessions are the ONLY ones that need a transcript read; | ||
| -- excluded directories are covered by the durable change artifacts instead. | ||
| ``` | ||
| Collectors live in `~/.config/opencode/kb-collectors/`. Each file is a self-contained markdown recipe that describes one data source. Read all `*.md` files in that directory and sort them by the `priority` field in their YAML frontmatter (lower number = runs first). Read each file's body now so you can apply its query recipe and extraction rules during collection. |
| for meta in $KB_ROOT/openspec/*/changes/archive/<date>-*/kb-meta.yaml; do | ||
| grep '^worktree:' "$meta" | awk '{print $2}' | ||
| done |
|
|
||
| Before running the `opencode` collector, build the session exclusion set from the `openspec` collector output (priority 0, runs first): | ||
|
|
||
| For each date being enriched, read `$KB_ROOT/openspec/*/changes/archive/<date>-*/kb-meta.yaml` and collect every `worktree:` value. This set is passed into the `opencode` collector as the `NOT IN (...)` list. Sessions in excluded worktrees are already covered by the durable OpenSpec change artifacts (`design.md`/specs); they do not need a transcript read. |
| Token is in `~/.config/team-context-mcp/.env` as `SLACK_USER_TOKEN`. Set `SLACK_USER_ID` to your Slack user ID (find it in your Slack profile → "Copy member ID"). | ||
|
|
||
| ```bash | ||
| SLACK_TOKEN=$(grep SLACK_USER_TOKEN ~/.config/team-context-mcp/.env | cut -d= -f2) | ||
| SLACK_USER_ID="<your-slack-user-id>" # e.g. U01ABC23DEF |
| # skip_bots: commit authors / PR actors to ignore | ||
| skip_bots: [dependabot] | ||
| --- |
Summary
Refactors
/kb-enrichinto a two-layer design: a thin orchestrator command plus a directory of self-contained collector files, one per data source.Changes
commands/kb-enrich.md(rewritten as 5-step orchestrator)The command no longer hard-codes any sources. It now:
KB_ROOTand a re-run guard~/.config/opencode/kb-collectors/*.md— reads frontmatter to getname,enabled,priority; sorts by prioritykb-collectors/(new directory, 7 collector files)openspec.mdgranola.mdslack.mdlinear.mdgh.mdopencode.mdgoogle-chat.mdEach file is self-contained: YAML frontmatter (
name,enabled,priority,authoritative_for) + a body with query recipe, triage rules, and extraction rules.Why
.mdfile inkb-collectors/. Zero changes to the orchestrator.enabled: falsein frontmatter.The session exclusion (OpenSpec dedup) logic is preserved and works the same way — the
openspeccollector runs at priority 0 and builds the exclusion set beforeopencoderuns at priority 5.