Skip to content

Feature/5 embedding and indexing#6

Merged
DivyenduDutta merged 21 commits into
masterfrom
feature/5-embedding-and-indexing
Jan 3, 2026
Merged

Feature/5 embedding and indexing#6
DivyenduDutta merged 21 commits into
masterfrom
feature/5-embedding-and-indexing

Conversation

@DivyenduDutta

Copy link
Copy Markdown
Owner

#5

Add embedding and indexing module

Added

  • Embedding module which uses a sentence transformer (encoder only transformer model) to generate embedding of the chunks
  • Indexing module which builds a vector index using FAISS. This improves the lookup and search time during semantic search of relevant chunks given an input query

Added
- python commands to run chunker and embedding modules
- other relevant info about these modules in their sections in the README
Modified package path from `atlas.core.ingest` to `atlas.core.ingester`
Added configuration file for the encoder. Allows to change the encoder used and its settings.
Added associated dataclass and configuration loading function.
Converted certain abstract methods to implementation methods in abstract base class and removed those methods from implementation class.
Exploring SentenceTransformer's `all-MiniLM-L6-v2` encoder model
Introduced a BaseEmbedder interface to standardize the embedding workflow and added a concrete SentenceTransformerEmbedder implementation. This provides a clean abstraction for embedding generation, simplifies future embedder extensions.
…ests

Added conftest.py to save global pytest fixtures
Updated existing unit tests appropriately
Added unit tests for newly added embedding module
Set device to cuda or cpu based on presence of pytorch. Fixes CUDA related error on Github CI
Updated minor changes to architecture diagram
Added FAISS library as dependency to be installed via conda
Changed numpy to be installed via conda instead of pip because otherwise numpy pip version does not play well with FAISS conda
Added missing docstrings and logging statements
FAISS vector store class
- build index
- searches an input query in the index
- saves and loads the index and metadata file
This just reads the json with all the chunk embedddings, builds the index and saves it
Made the architecture diagram path branch invariant
@DivyenduDutta DivyenduDutta added this to the First Iteration milestone Jan 3, 2026
@DivyenduDutta DivyenduDutta self-assigned this Jan 3, 2026
@DivyenduDutta DivyenduDutta linked an issue Jan 3, 2026 that may be closed by this pull request
@DivyenduDutta

Copy link
Copy Markdown
Owner Author

Results in index.faiss and metadata.json files

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 92.44604% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.14%. Comparing base (408b4c2) to head (8f32ad1).

Files with missing lines Patch % Lines
...ore/embedder/sentence_transformer/impl_embedder.py 73.52% 7 Missing and 2 partials ⚠️
atlas/core/indexer/base_vector_store.py 76.47% 4 Missing ⚠️
atlas/core/indexer/faiss_vector_store.py 95.00% 1 Missing and 2 partials ⚠️
atlas/core/embedder/base/base_embedder.py 95.45% 2 Missing ⚠️
atlas/core/embedder/base/base_encoder.py 84.61% 2 Missing ⚠️
...core/embedder/sentence_transformer/impl_encoder.py 96.29% 0 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master       #6      +/-   ##
==========================================
+ Coverage   82.14%   89.14%   +7.00%     
==========================================
  Files           7       15       +8     
  Lines         280      525     +245     
  Branches       36       49      +13     
==========================================
+ Hits          230      468     +238     
- Misses         41       43       +2     
- Partials        9       14       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DivyenduDutta DivyenduDutta merged commit 3cbba4e into master Jan 3, 2026
2 checks passed
@DivyenduDutta DivyenduDutta deleted the feature/5-embedding-and-indexing branch January 3, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Embedding and Indexing Module

2 participants