DivyenduDutta · DivyenduDutta · Jan 3, 2026 · Dec 30, 2025 · Dec 30, 2025 · Dec 30, 2025
diff --git a/.gitignore b/.gitignore
@@ -215,3 +215,4 @@ logs/
 
 # Resources not needed to checkin
 Resources/*.json
+Resources/*.faiss
diff --git a/README.md b/README.md
@@ -39,15 +39,17 @@ RAG or Retrieval Augmented Generation is a technique used to retrieve external k
 
 RAG is good because:
 - Reduces hallucinations
-
 - Enables citations
-
 - Keeps answers faithful to source material
 
 #### Chunking
 
 [Chunker Module](atlas/core/chunker/README.md)
 
+#### Embedding and Indexing
+
+[Embedding Module](atlas/core/embedder/README.md)
+
 #### Obsidian
 
 [Obsidian](https://obsidian.md/) is a light weight application used to take notes and create knowledge bases. It saves all the notes as markdown making it easy to load, process and render a huge amount of notes.
@@ -76,7 +78,7 @@ So it follows the scaling law that even a small LLM when trained on enough quali
 
 ## Architecture
 
-Initial high level [architecture diagram](https://github.com/DivyenduDutta/Atlas/tree/master/Resources/Atlas_Architecture.png)
+High level [architecture diagram](Resources/Atlas_Architecture.png)
 
 A sample of the `obsidian_index.json` is as below:
 
@@ -120,10 +122,56 @@ Before committing changes run `pre-commit run --all-files` or `pre-commit run --
 
 Run `python .\atlas\core\ingest\obsidian_vault_processor.py`
 
-This will generate the `obsidian_index.json` in `/Resources` folder. This json file contains the processed data after ingesting and processing the notes from the obsidian vault.
+In the above script, modify
+- `obsidian_vault_path` to point to your obsidian vault's root folder ie, the folder containing `.obsidian` folder
+- `obsidian_index_path` to specify where the `obsidian_index.json` will be saved. This json file contains the processed data after ingesting and processing the notes from the obsidian vault. See [architecture](#architecture) section for the structure of this json.
+
+### Structural Chunker Module
+
+Run `python .\atlas\core\chunker\structural_chunker.py`
+
+In the above script, modify
+- `processed_data_path` to specify where the `obsidian_index.json` is present
+- `output_path` to specify where the `chunked_data.json` will be saved. This json file contains the chunks generated from the notes processed by the "Obsidian Vault Processor" module. See [`README` in `atlas/core/chunker`](atlas/core/chunker/README.md) for structure of this json.
+- `max_words` to set what determines the size of chunks created. This should be changed primarily based on the token limit of the encoding model and context size of the LLM used in the later modules.
+
+### Embedder Module
+
+Run `python .\atlas\core\embedder\sentence_transformer\impl_embedder.py`
+
+In the above script modify,
+- `chunk_data_path` to specify where the `chunked_data.json`is present
+- `output_path` to specify where `embedded_chunks.json` will be saved. This json is exactly similar to
+`chunked_data.json` with the added `embedding` for each chunk. See [`README` in `atlas/core/embedder`](atlas/core/embedder/README.md) for structure of this json.
+- `encoder_config_path` to specify your own configuration settings for the encoder model used to generate the chunk embeddings. By default, see [`altas/core/configs/sentence_transformer_config.yaml`](atlas/core/configs/sentence_transformer_config.yaml) for changing the encoder model used and its configuration. The following can be changed:
 
-See architecture section for structure of this json.
+```yaml
+model_name: sentence-transformers/all-MiniLM-L6-v2
+batch_size: 32
+normalize_embeddings: true
+device: cuda
+```
+
+### Indexer Module
+
+Run `python .\atlas\core\indexer\run_indexer.py`
+
+In the above script modify,
+- `results_save_path` to specify where the index and metadata file will be saved
+- `embedded_chunks_json_file` to specify where the `embedded_chunks.json` is present
 
 ### Tests
 
-Run unit tests via VS Code or `python -m unittest` to run all unit tests
+Run unit tests via VS Code
+
+or
+
+Run only unit tests - `pytest -m unittest`
+
+Run only integration tests - `pytest -m integration`
+
+Run only tests that can be run on CI - `pytest -m runonci`
+
+Run ALL tests - `pytest`
+
+Note : Anytime a pytest marker is added to a pytest, ensure it is registered in `pytest.ini` otherwise pytest will complain
diff --git a/Resources/Atlas_Architecture.drawio b/Resources/Atlas_Architecture.drawio
@@ -80,8 +80,16 @@
         <mxCell id="TXnst_0s-ZQ6Iyvg5Zda-28" edge="1" parent="1" source="TXnst_0s-ZQ6Iyvg5Zda-27" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" target="TXnst_0s-ZQ6Iyvg5Zda-25">
           <mxGeometry relative="1" as="geometry" />
         </mxCell>
-        <mxCell id="TXnst_0s-ZQ6Iyvg5Zda-27" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=LightGreen;align=center;verticalAlign=middle;rounded=0;" value="Prompt" vertex="1">
-          <mxGeometry height="60" width="100" x="-75" y="310" as="geometry" />
+        <mxCell id="bvvUrWFopLgPZ60e--PS-1" edge="1" parent="1" source="TXnst_0s-ZQ6Iyvg5Zda-27" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" target="TXnst_0s-ZQ6Iyvg5Zda-20" value="">
+          <mxGeometry relative="1" as="geometry">
+            <Array as="points">
+              <mxPoint x="110" y="380" />
+              <mxPoint x="295" y="380" />
+            </Array>
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="TXnst_0s-ZQ6Iyvg5Zda-27" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=LightGreen;align=center;verticalAlign=middle;rounded=0;" value="Prompt/Query" vertex="1">
+          <mxGeometry height="40" width="100" x="60" y="320" as="geometry" />
         </mxCell>
         <mxCell id="TXnst_0s-ZQ6Iyvg5Zda-31" edge="1" parent="1" source="TXnst_0s-ZQ6Iyvg5Zda-29" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;" target="TXnst_0s-ZQ6Iyvg5Zda-25">
           <mxGeometry relative="1" as="geometry" />
@@ -95,7 +103,7 @@
         <mxCell id="TXnst_0s-ZQ6Iyvg5Zda-34" parent="1" style="shape=callout;whiteSpace=wrap;html=1;perimeter=calloutPerimeter;base=71;size=35;position=0.33;position2=0.8;" value="This is done via an LLM" vertex="1">
           <mxGeometry height="80" width="150" x="-260" y="355" as="geometry" />
         </mxCell>
-        <mxCell id="ylUvblSnaKISxEn222B1-1" parent="1" style="shape=callout;whiteSpace=wrap;html=1;perimeter=calloutPerimeter;" value="Saved in a Vector DB" vertex="1">
+        <mxCell id="ylUvblSnaKISxEn222B1-1" parent="1" style="shape=callout;whiteSpace=wrap;html=1;perimeter=calloutPerimeter;" value="Saved in a Vector Index" vertex="1">
           <mxGeometry height="80" width="120" x="420" y="355" as="geometry" />
         </mxCell>
       </root>

diff --git a/Resources/Atlas_Architecture.png b/Resources/Atlas_Architecture.png
diff --git a/atlas/core/chunker/base_chunker.py b/atlas/core/chunker/base_chunker.py
@@ -1,6 +1,8 @@
 from abc import ABC
 from abc import abstractmethod
 from typing import List, Dict
+from pathlib import Path
+import json
 
 from atlas.utils.logger import LoggerConfig
 
@@ -10,15 +12,35 @@
 class BaseChunker(ABC):
     """
     Abstract base class for chunkers that split processed data into smaller "retrieval units.
+
+    Args:
+        processed_data_path (str): Path to the processed data file.
+        output_path (str): Path to save the chunked data.
     """
 
-    @abstractmethod
+    def __init__(self, processed_data_path: str, output_path: str) -> None:
+        LOGGER.info("-" * 20)
+        LOGGER.info("StructuralChunker initialized.")
+        LOGGER.info(f"Chunking processed data at {processed_data_path}")
+        self.processed_data_path = Path(processed_data_path)
+        self.output_path = Path(output_path)
+
     def read_processed_data(self) -> List[Dict] | None:
         """
         Read the processed data which is the output of the previous module
-        ie,`KnowledgeBaseProcessor`.
+        ie, `KnowledgeBaseProcessor`.
+
+        Returns:
+            List[Dict] | None: The processed data as a list of dictionaries or None if an error occurs.
         """
-        pass
+        try:
+            with open(self.processed_data_path, "r", encoding="utf-8") as file:
+                data = json.load(file)
+            LOGGER.info("Processed data successfully read.")
+            return data
+        except Exception as e:
+            LOGGER.error(f"Error reading processed data: {e}")
+            return None
 
     @abstractmethod
     def create_chunks(self, processed_data: List[Dict]) -> List[Dict]:
@@ -33,15 +55,23 @@ def create_chunks(self, processed_data: List[Dict]) -> List[Dict]:
         """
         pass
 
-    @abstractmethod
     def save_chunked_data(self, chunked_data: List[Dict]) -> None:
         """
-        Save the chunked data to a format suitable for later use.
+        Save the chunked data to the output path in JSON format.
+        This method writes to a temporary file first and then renames it to ensure atomicity.
+        This prevents data corruption in case of interruptions during the write process.
 
         Args:
-            chunked_data (list[dict]): The list of chunked data.
+            chunked_data (List[Dict]): The chunked data to be saved.
         """
-        pass
+        self.output_path.parent.mkdir(parents=True, exist_ok=True)
+        tmp_path = self.output_path.with_suffix(".tmp")
+
+        with tmp_path.open("w", encoding="utf-8") as f:
+            json.dump(chunked_data, f, indent=2, ensure_ascii=False)
+
+        tmp_path.replace(self.output_path)
+        LOGGER.info(f"Chunks saved successfully to {str(self.output_path)}")
 
     def chunk(self) -> None:
         """

diff --git a/atlas/core/chunker/structural_chunker.py b/atlas/core/chunker/structural_chunker.py
@@ -24,31 +24,9 @@ class StructuralChunker(BaseChunker):
     def __init__(
         self, processed_data_path: str, output_path: str, max_words: int
     ) -> None:
-        LOGGER.info("-" * 20)
-        LOGGER.info("StructuralChunker initialized.")
-        LOGGER.info(f"Chunking processed data at {processed_data_path}")
-        self.processed_data_path = Path(processed_data_path)
-        self.output_path = Path(output_path)
+        super().__init__(processed_data_path, output_path)
         self.max_words = max_words
 
-    def read_processed_data(self) -> List[Dict] | None:
-        """
-        Read the obsidian indexed data which is the output of the previous module
-        ie, `ObsidianVaultProcessor`.
-
-        Returns:
-            List[Dict] | None: The obsidian indexed data as a list of dictionaries or None if an error occurs.
-        """
-
-        try:
-            with open(self.processed_data_path, "r", encoding="utf-8") as file:
-                data = json.load(file)
-            LOGGER.info("Obsidian indexed data successfully read.")
-            return data
-        except Exception as e:
-            LOGGER.error(f"Error reading processed data: {e}")
-            return None
-
     def _split_by_word_limit(self, text: str, max_words: int) -> list[str]:
         """
         Split text into chunks based on a maximum word limit.
@@ -213,22 +191,6 @@ def create_chunks(self, processed_data: List[Dict]) -> List[Dict]:
 
         return chunks
 
-    def save_chunked_data(self, chunked_data: List[Dict]) -> None:
-        """
-        Save the chunked data to the output path in JSON format.
-        This method writes to a temporary file first and then renames it to ensure atomicity.
-        This prevents data corruption in case of interruptions during the write process.
-
-        Args:
-            chunked_data (List[Dict]): The chunked data to be saved.
-        """
-        tmp_path = self.output_path.with_suffix(".tmp")
-
-        with tmp_path.open("w", encoding="utf-8") as f:
-            json.dump(chunked_data, f, indent=2, ensure_ascii=False)
-
-        tmp_path.replace(self.output_path)
-
 
 if __name__ == "__main__":
     processed_data_path = r"D:\\Deep learning\\Atlas\\Resources\\obsidian_index.json"

diff --git a/atlas/core/configs/sentence_transformer_config.yaml b/atlas/core/configs/sentence_transformer_config.yaml
@@ -0,0 +1,4 @@
+model_name: sentence-transformers/all-MiniLM-L6-v2
+batch_size: 32
+normalize_embeddings: true
+device: cuda
diff --git a/atlas/core/embedder/README.md b/atlas/core/embedder/README.md
@@ -0,0 +1,61 @@
+## Embedder Module
+
+LLM's dont really understand text. Hence, the text needs to be converted to a numeric representation, more specifically a vector called embedding. This is just a numeric representation in a low dimensional space. Two vectors close to each other in this space represent two texts which are close to each other semantically.
+
+### Encoder Model Choice
+
+`sentence-transformers/all-MiniLM-L6-v2` from [Sentence Transformers](https://www.sbert.net/) was chosen because its,
+- fast and lightweight (super important for latency)
+- provides really good [semantic search](https://www.sbert.net/examples/sentence_transformer/applications/semantic-search/README.html#background) performance
+
+
+The encoder model is ultimately used for semantic search.
+
+#### What is Semantic Search?
+
+1. Take chunks → embed into vector space
+2. Take query → embed into same space
+3. Find nearest neighbors (cosine / dot / L2)
+4. Return top-k chunks
+
+#### Why not use TinyLLama's encoder
+
+- There are three types of Transformer models
+    - Encoder only models
+        - eg, BERT, ROBERTa, MiniLM
+    - Decoder only models
+        - LLama/TinyLlama/GPT-2
+        - they dont have an explicit encoder model in their architecture but they do encoding on text internally
+    - Encoder - Decoder models
+        - BART, T5, FLAN
+
+- TinyLlama being a decoder only model is specifically trained for next token prediction (the encoding is still done but its not the main focus and it does not have an encoder in the architectural sense).
+- Whereas encoder only models are specifically trained generate embeddings and further use cases of embeddings (like retrieval, semantic search)
+
+#### Structure of embedding chunks json
+
+```json
+[
+  {
+    "chunk_id": "folder/sample note.md::Heading 1::0",
+    "note_id": "folder/sample note.md",
+    "title": "sample note",
+    "relative_path": "folder/sample note.md",
+    "heading": "Heading 1",
+    "chunk_index": 0,
+    "text": "lorem ipsum",
+    "word_count": 2,
+    "tags": [],
+    "frontmatter": {},
+    "embedding": [
+      0.017203988507390022,
+      0.06233978644013405,
+      -0.011157829314470291,
+      -0.012113398872315884,
+      ...
+    ]
+  },
+  ...
+]
+```
+- This is same as the json output of the chunker module with the added `embedding` key. This represents the vector representation of the `text` as provided by the chosen encoder model.
diff --git a/atlas/core/ingest/__init__.py → atlas/core/embedder/__init__.py b/atlas/core/ingest/__init__.py → atlas/core/embedder/__init__.py
diff --git a/atlas/core/embedder/base/__init__.py b/atlas/core/embedder/base/__init__.py
Original file line number	Diff line number	Diff line change
Expand Up		@@ -215,3 +215,4 @@ logs/

		# Resources not needed to checkin
		Resources/*.json
		Resources/*.faiss