Skip to content

Improve topic-focused retrieval coverage for writing (increase recall, reduce missed related articles) #1

Description

@szverev

Problem

When drafting an article for a specific topic/query, the system does not consistently surface all relevant analyzed articles. This reduces factual coverage and makes the output feel under-grounded in the Telegram-derived corpus.

Today we have:

  • Topic discovery via embedding clustering
  • Query-driven writing via nearest-neighbor retrieval over article embeddings

However, users report that many related articles are not being included for a given topic.

Goals

  • Increase recall for topic/query-driven writing so the retrieved source set is more complete and representative.
  • Preserve precision (avoid flooding with unrelated sources).
  • Keep the system transparent: users should be able to see why an article was selected.

Proposed approach

  • Add a two-stage retrieval strategy:
    1. Broad candidate fetch (larger K, e.g. 50-200) using pgvector cosine distance
    2. Re-rank + diversify (MMR-style) and downselect to N sources (e.g. 6-12)
  • Incorporate additional signals:
    • Topic linkage graph (topic_articles)
    • Recency / popularity (views/forwards where available)
    • Source/domain quality heuristics (optional)
  • Add debug output / telemetry:
    • Print top candidates with distance/similarity
    • Persist retrieval diagnostics in DB for later audit

Acceptance criteria

  • Given a stable analyzed DB snapshot, writing for a query returns a visibly improved source set (higher topical coverage).
  • Retrieval prints (or can be enabled to print) the selected sources and their similarity scores.
  • Unit/integration test added for retrieval selection logic (deterministic with fixed embeddings).

Notes

  • Keep TLS verification enabled; no changes to security posture.
  • Avoid calling external web search: this should rely on our analyzed corpus.

Tasks

  • Implement broad candidate retrieval + reranking
  • Add MMR/diversification (or similar)
  • Add diagnostics (console + optional persisted metadata)
  • Add tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestretrievalSemantic retrieval, embeddings, ranking

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions