Skip to content

Integrate embedding & semantic search pipeline #33

@justinmadison

Description

@justinmadison

Summary

Compute vector embeddings for each normalized article and provide a semantic search interface to find similar articles by content.

Motivation

  • Enables “More like this” recommendations and clustering of related stories.
  • Powers a search experience that goes beyond keyword matching, surfacing semantically relevant results.

Scope

In scope: implementation, tests

Acceptance Criteria

  • compute_embedding(text) returns a fixed-length float vector.
  • embed_task(article_id) stores the embedding for the article in the vector index.
  • search_similar(query, top_k) returns the top K most semantically similar articles.
  • CLI commands embed and search run without errors and print expected output.
  • All tests pass in CI and README clearly documents embedding & search workflows.

Additional Context

  1. Add dependencies
    • Add sentence-transformers and faiss-cpu (or equivalent) to /nlp/requirements.txt.
  2. Core function signatures (/nlp/core.py)
    • def compute_embedding(text: str) -> List[float]
    • def index_article(article_id: str, embedding: List[float]) -> None
    • def search_similar(query: str, top_k: int = 5) -> List[Dict]
  3. Celery task hook (/nlp/tasks.py)
    • Register:
      @app.task
      def embed_task(article_id: str) -> List[float]
    • Should call compute_embedding, then index_article.
  4. CLI entrypoints (/nlp/cli.py)
    • python -m nlp.cli embed --article-id=<id>
    • python -m nlp.cli search --query="..." --top-k=5
  5. Tests & documentation
    • Create /nlp/tests/test_core_embedding.py to:
      • Assert compute_embedding() returns a vector of the expected dimension.
      • Assert that search_similar() returns a non-empty list for a sample query.
    • Create /nlp/tests/test_embed_task.py to:
      • Mock DB and vector store, verify embed_task() calls both core functions.
    • Update /nlp/README.md with:
      • Installation steps
      • How to run embed_task via Celery
      • CLI usage examples for embed and search

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions