Problem
When drafting an article for a specific topic/query, the system does not consistently surface all relevant analyzed articles. This reduces factual coverage and makes the output feel under-grounded in the Telegram-derived corpus.
Today we have:
- Topic discovery via embedding clustering
- Query-driven writing via nearest-neighbor retrieval over article embeddings
However, users report that many related articles are not being included for a given topic.
Goals
- Increase recall for topic/query-driven writing so the retrieved source set is more complete and representative.
- Preserve precision (avoid flooding with unrelated sources).
- Keep the system transparent: users should be able to see why an article was selected.
Proposed approach
- Add a two-stage retrieval strategy:
- Broad candidate fetch (larger K, e.g. 50-200) using pgvector cosine distance
- Re-rank + diversify (MMR-style) and downselect to N sources (e.g. 6-12)
- Incorporate additional signals:
- Topic linkage graph (topic_articles)
- Recency / popularity (views/forwards where available)
- Source/domain quality heuristics (optional)
- Add debug output / telemetry:
- Print top candidates with distance/similarity
- Persist retrieval diagnostics in DB for later audit
Acceptance criteria
- Given a stable analyzed DB snapshot, writing for a query returns a visibly improved source set (higher topical coverage).
- Retrieval prints (or can be enabled to print) the selected sources and their similarity scores.
- Unit/integration test added for retrieval selection logic (deterministic with fixed embeddings).
Notes
- Keep TLS verification enabled; no changes to security posture.
- Avoid calling external web search: this should rely on our analyzed corpus.
Tasks
Problem
When drafting an article for a specific topic/query, the system does not consistently surface all relevant analyzed articles. This reduces factual coverage and makes the output feel under-grounded in the Telegram-derived corpus.
Today we have:
However, users report that many related articles are not being included for a given topic.
Goals
Proposed approach
Acceptance criteria
Notes
Tasks