Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
d573a93
[FEAT] Add sentence-transformers dependency to env yaml file
DivyenduDutta Dec 30, 2025
a35ddb6
[DOC] Add information about chunker and embedding modules
DivyenduDutta Dec 30, 2025
e372e8c
[CHORE] Modify package path for all files in the Ingest module
DivyenduDutta Dec 30, 2025
799ff89
[FEAT] Add encoder config file and associated code
DivyenduDutta Dec 30, 2025
38c2dd9
[CHORE] Refactor abstract base class and impl class for chunker module
DivyenduDutta Dec 30, 2025
10b6531
[DOC] Add README for embedding module
DivyenduDutta Dec 30, 2025
06bf513
[FEAT] Add jupyter notebook to explore encoder model
DivyenduDutta Dec 30, 2025
cb3ab75
[FEAT] Implement base and sentence transformer encoder wrapper classes
DivyenduDutta Dec 30, 2025
3ed1651
[FEAT] Implement BaseEmbedder and SentenceTransformerEmbedder classes
DivyenduDutta Dec 30, 2025
769f14b
[TEST] Add unit tests for embedding module and modify existing unit t…
DivyenduDutta Dec 30, 2025
fe15b16
[FIX] Add check for pytorch before setting device
DivyenduDutta Dec 30, 2025
0d0a436
[CHORE] Update architecture diagram
DivyenduDutta Jan 3, 2026
55c1407
[CHORE] Add FAISS index file to gitignore
DivyenduDutta Jan 3, 2026
cf2ef10
[FEAT] Add FAISS dependency
DivyenduDutta Jan 3, 2026
21b92bb
[CHORE] Minor changes
DivyenduDutta Jan 3, 2026
4a0bf03
[DOC] Update main readme and add readme for indexer module
DivyenduDutta Jan 3, 2026
adc16ad
[FEAT] Implement base and FAISS vector store classes for vector indexing
DivyenduDutta Jan 3, 2026
6fb63ce
[FEAT] Add script for embedding and vector indexing
DivyenduDutta Jan 3, 2026
a43873f
[FEAT] Add utility functions for loading embedded chunks and generati…
DivyenduDutta Jan 3, 2026
982dc34
[FEAT]] Add unit tests for FAISS vector store and embedder utility fu…
DivyenduDutta Jan 3, 2026
8f32ad1
[DOC] Fix architecture diagram path
DivyenduDutta Jan 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -215,3 +215,4 @@ logs/

# Resources not needed to checkin
Resources/*.json
Resources/*.faiss
60 changes: 54 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,15 +39,17 @@ RAG or Retrieval Augmented Generation is a technique used to retrieve external k

RAG is good because:
- Reduces hallucinations

- Enables citations

- Keeps answers faithful to source material

#### Chunking

[Chunker Module](atlas/core/chunker/README.md)

#### Embedding and Indexing

[Embedding Module](atlas/core/embedder/README.md)

#### Obsidian

[Obsidian](https://obsidian.md/) is a light weight application used to take notes and create knowledge bases. It saves all the notes as markdown making it easy to load, process and render a huge amount of notes.
Expand Down Expand Up @@ -76,7 +78,7 @@ So it follows the scaling law that even a small LLM when trained on enough quali

## Architecture

Initial high level [architecture diagram](https://github.com/DivyenduDutta/Atlas/tree/master/Resources/Atlas_Architecture.png)
High level [architecture diagram](Resources/Atlas_Architecture.png)

A sample of the `obsidian_index.json` is as below:

Expand Down Expand Up @@ -120,10 +122,56 @@ Before committing changes run `pre-commit run --all-files` or `pre-commit run --

Run `python .\atlas\core\ingest\obsidian_vault_processor.py`

This will generate the `obsidian_index.json` in `/Resources` folder. This json file contains the processed data after ingesting and processing the notes from the obsidian vault.
In the above script, modify
- `obsidian_vault_path` to point to your obsidian vault's root folder ie, the folder containing `.obsidian` folder
- `obsidian_index_path` to specify where the `obsidian_index.json` will be saved. This json file contains the processed data after ingesting and processing the notes from the obsidian vault. See [architecture](#architecture) section for the structure of this json.

### Structural Chunker Module

Run `python .\atlas\core\chunker\structural_chunker.py`

In the above script, modify
- `processed_data_path` to specify where the `obsidian_index.json` is present
- `output_path` to specify where the `chunked_data.json` will be saved. This json file contains the chunks generated from the notes processed by the "Obsidian Vault Processor" module. See [`README` in `atlas/core/chunker`](atlas/core/chunker/README.md) for structure of this json.
- `max_words` to set what determines the size of chunks created. This should be changed primarily based on the token limit of the encoding model and context size of the LLM used in the later modules.

### Embedder Module

Run `python .\atlas\core\embedder\sentence_transformer\impl_embedder.py`

In the above script modify,
- `chunk_data_path` to specify where the `chunked_data.json`is present
- `output_path` to specify where `embedded_chunks.json` will be saved. This json is exactly similar to
`chunked_data.json` with the added `embedding` for each chunk. See [`README` in `atlas/core/embedder`](atlas/core/embedder/README.md) for structure of this json.
- `encoder_config_path` to specify your own configuration settings for the encoder model used to generate the chunk embeddings. By default, see [`altas/core/configs/sentence_transformer_config.yaml`](atlas/core/configs/sentence_transformer_config.yaml) for changing the encoder model used and its configuration. The following can be changed:

See architecture section for structure of this json.
```yaml
model_name: sentence-transformers/all-MiniLM-L6-v2
batch_size: 32
normalize_embeddings: true
device: cuda
```

### Indexer Module

Run `python .\atlas\core\indexer\run_indexer.py`

In the above script modify,
- `results_save_path` to specify where the index and metadata file will be saved
- `embedded_chunks_json_file` to specify where the `embedded_chunks.json` is present

### Tests

Run unit tests via VS Code or `python -m unittest` to run all unit tests
Run unit tests via VS Code

or

Run only unit tests - `pytest -m unittest`

Run only integration tests - `pytest -m integration`

Run only tests that can be run on CI - `pytest -m runonci`

Run ALL tests - `pytest`

Note : Anytime a pytest marker is added to a pytest, ensure it is registered in `pytest.ini` otherwise pytest will complain
14 changes: 11 additions & 3 deletions Resources/Atlas_Architecture.drawio
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,16 @@
<mxCell id="TXnst_0s-ZQ6Iyvg5Zda-28" edge="1" parent="1" source="TXnst_0s-ZQ6Iyvg5Zda-27" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" target="TXnst_0s-ZQ6Iyvg5Zda-25">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="TXnst_0s-ZQ6Iyvg5Zda-27" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=LightGreen;align=center;verticalAlign=middle;rounded=0;" value="Prompt" vertex="1">
<mxGeometry height="60" width="100" x="-75" y="310" as="geometry" />
<mxCell id="bvvUrWFopLgPZ60e--PS-1" edge="1" parent="1" source="TXnst_0s-ZQ6Iyvg5Zda-27" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" target="TXnst_0s-ZQ6Iyvg5Zda-20" value="">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="110" y="380" />
<mxPoint x="295" y="380" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="TXnst_0s-ZQ6Iyvg5Zda-27" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=LightGreen;align=center;verticalAlign=middle;rounded=0;" value="Prompt/Query" vertex="1">
<mxGeometry height="40" width="100" x="60" y="320" as="geometry" />
</mxCell>
<mxCell id="TXnst_0s-ZQ6Iyvg5Zda-31" edge="1" parent="1" source="TXnst_0s-ZQ6Iyvg5Zda-29" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0.5;entryDx=0;entryDy=0;" target="TXnst_0s-ZQ6Iyvg5Zda-25">
<mxGeometry relative="1" as="geometry" />
Expand All @@ -95,7 +103,7 @@
<mxCell id="TXnst_0s-ZQ6Iyvg5Zda-34" parent="1" style="shape=callout;whiteSpace=wrap;html=1;perimeter=calloutPerimeter;base=71;size=35;position=0.33;position2=0.8;" value="This is done via an LLM" vertex="1">
<mxGeometry height="80" width="150" x="-260" y="355" as="geometry" />
</mxCell>
<mxCell id="ylUvblSnaKISxEn222B1-1" parent="1" style="shape=callout;whiteSpace=wrap;html=1;perimeter=calloutPerimeter;" value="Saved in a Vector DB" vertex="1">
<mxCell id="ylUvblSnaKISxEn222B1-1" parent="1" style="shape=callout;whiteSpace=wrap;html=1;perimeter=calloutPerimeter;" value="Saved in a Vector Index" vertex="1">
<mxGeometry height="80" width="120" x="420" y="355" as="geometry" />
</mxCell>
</root>
Expand Down
Binary file modified Resources/Atlas_Architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 37 additions & 7 deletions atlas/core/chunker/base_chunker.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from abc import ABC
from abc import abstractmethod
from typing import List, Dict
from pathlib import Path
import json

from atlas.utils.logger import LoggerConfig

Expand All @@ -10,15 +12,35 @@
class BaseChunker(ABC):
"""
Abstract base class for chunkers that split processed data into smaller "retrieval units.

Args:
processed_data_path (str): Path to the processed data file.
output_path (str): Path to save the chunked data.
"""

@abstractmethod
def __init__(self, processed_data_path: str, output_path: str) -> None:
LOGGER.info("-" * 20)
LOGGER.info("StructuralChunker initialized.")
LOGGER.info(f"Chunking processed data at {processed_data_path}")
self.processed_data_path = Path(processed_data_path)
self.output_path = Path(output_path)

def read_processed_data(self) -> List[Dict] | None:
"""
Read the processed data which is the output of the previous module
ie,`KnowledgeBaseProcessor`.
ie, `KnowledgeBaseProcessor`.

Returns:
List[Dict] | None: The processed data as a list of dictionaries or None if an error occurs.
"""
pass
try:
with open(self.processed_data_path, "r", encoding="utf-8") as file:
data = json.load(file)
LOGGER.info("Processed data successfully read.")
return data
except Exception as e:
LOGGER.error(f"Error reading processed data: {e}")
return None

@abstractmethod
def create_chunks(self, processed_data: List[Dict]) -> List[Dict]:
Expand All @@ -33,15 +55,23 @@ def create_chunks(self, processed_data: List[Dict]) -> List[Dict]:
"""
pass

@abstractmethod
def save_chunked_data(self, chunked_data: List[Dict]) -> None:
"""
Save the chunked data to a format suitable for later use.
Save the chunked data to the output path in JSON format.
This method writes to a temporary file first and then renames it to ensure atomicity.
This prevents data corruption in case of interruptions during the write process.

Args:
chunked_data (list[dict]): The list of chunked data.
chunked_data (List[Dict]): The chunked data to be saved.
"""
pass
self.output_path.parent.mkdir(parents=True, exist_ok=True)
tmp_path = self.output_path.with_suffix(".tmp")

with tmp_path.open("w", encoding="utf-8") as f:
json.dump(chunked_data, f, indent=2, ensure_ascii=False)

tmp_path.replace(self.output_path)
LOGGER.info(f"Chunks saved successfully to {str(self.output_path)}")

def chunk(self) -> None:
"""
Expand Down
40 changes: 1 addition & 39 deletions atlas/core/chunker/structural_chunker.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,31 +24,9 @@ class StructuralChunker(BaseChunker):
def __init__(
self, processed_data_path: str, output_path: str, max_words: int
) -> None:
LOGGER.info("-" * 20)
LOGGER.info("StructuralChunker initialized.")
LOGGER.info(f"Chunking processed data at {processed_data_path}")
self.processed_data_path = Path(processed_data_path)
self.output_path = Path(output_path)
super().__init__(processed_data_path, output_path)
self.max_words = max_words

def read_processed_data(self) -> List[Dict] | None:
"""
Read the obsidian indexed data which is the output of the previous module
ie, `ObsidianVaultProcessor`.

Returns:
List[Dict] | None: The obsidian indexed data as a list of dictionaries or None if an error occurs.
"""

try:
with open(self.processed_data_path, "r", encoding="utf-8") as file:
data = json.load(file)
LOGGER.info("Obsidian indexed data successfully read.")
return data
except Exception as e:
LOGGER.error(f"Error reading processed data: {e}")
return None

def _split_by_word_limit(self, text: str, max_words: int) -> list[str]:
"""
Split text into chunks based on a maximum word limit.
Expand Down Expand Up @@ -213,22 +191,6 @@ def create_chunks(self, processed_data: List[Dict]) -> List[Dict]:

return chunks

def save_chunked_data(self, chunked_data: List[Dict]) -> None:
"""
Save the chunked data to the output path in JSON format.
This method writes to a temporary file first and then renames it to ensure atomicity.
This prevents data corruption in case of interruptions during the write process.

Args:
chunked_data (List[Dict]): The chunked data to be saved.
"""
tmp_path = self.output_path.with_suffix(".tmp")

with tmp_path.open("w", encoding="utf-8") as f:
json.dump(chunked_data, f, indent=2, ensure_ascii=False)

tmp_path.replace(self.output_path)


if __name__ == "__main__":
processed_data_path = r"D:\\Deep learning\\Atlas\\Resources\\obsidian_index.json"
Expand Down
4 changes: 4 additions & 0 deletions atlas/core/configs/sentence_transformer_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
model_name: sentence-transformers/all-MiniLM-L6-v2
batch_size: 32
normalize_embeddings: true
device: cuda
61 changes: 61 additions & 0 deletions atlas/core/embedder/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
## Embedder Module

LLM's dont really understand text. Hence, the text needs to be converted to a numeric representation, more specifically a vector called embedding. This is just a numeric representation in a low dimensional space. Two vectors close to each other in this space represent two texts which are close to each other semantically.

### Encoder Model Choice

`sentence-transformers/all-MiniLM-L6-v2` from [Sentence Transformers](https://www.sbert.net/) was chosen because its,
- fast and lightweight (super important for latency)
- provides really good [semantic search](https://www.sbert.net/examples/sentence_transformer/applications/semantic-search/README.html#background) performance


The encoder model is ultimately used for semantic search.

#### What is Semantic Search?

1. Take chunks → embed into vector space
2. Take query → embed into same space
3. Find nearest neighbors (cosine / dot / L2)
4. Return top-k chunks

#### Why not use TinyLLama's encoder

- There are three types of Transformer models
- Encoder only models
- eg, BERT, ROBERTa, MiniLM
- Decoder only models
- LLama/TinyLlama/GPT-2
- they dont have an explicit encoder model in their architecture but they do encoding on text internally
- Encoder - Decoder models
- BART, T5, FLAN

- TinyLlama being a decoder only model is specifically trained for next token prediction (the encoding is still done but its not the main focus and it does not have an encoder in the architectural sense).
- Whereas encoder only models are specifically trained generate embeddings and further use cases of embeddings (like retrieval, semantic search)

#### Structure of embedding chunks json

```json
[
{
"chunk_id": "folder/sample note.md::Heading 1::0",
"note_id": "folder/sample note.md",
"title": "sample note",
"relative_path": "folder/sample note.md",
"heading": "Heading 1",
"chunk_index": 0,
"text": "lorem ipsum",
"word_count": 2,
"tags": [],
"frontmatter": {},
"embedding": [
0.017203988507390022,
0.06233978644013405,
-0.011157829314470291,
-0.012113398872315884,
...
]
},
...
]
```
- This is same as the json output of the chunker module with the added `embedding` key. This represents the vector representation of the `text` as provided by the chosen encoder model.
File renamed without changes.
Empty file.
Loading