A production-style RAG (Retrieval-Augmented Generation) application that answers natural language questions about Indian Large Cap mutual funds using real AMFI data, semantic search, and LLaMA 3 via Groq.
Ask questions like "Which large cap fund gave the best 3-year returns?" or "Compare SBI and HDFC large cap funds" — and get structured, data-backed answers instantly.
AMFI API ──────────────┐
(Daily NAV) ▼
fetch_schemes.py ──► scheme_list.csv
mfapi.in ─────────────┐
(5Y Historical NAV) ▼
fetch_nav.py ──────► nav_history.parquet
│
▼
compute_metrics.py ───► fund_metrics.csv
(CAGR · Sharpe · Sortino)
│
▼
build_chunks.py ──────► chunks.jsonl
(1 text doc per fund)
│
▼
embed_chunks.py ──────► ChromaDB (local)
(HuggingFace all-MiniLM-L6-v2)
│
User Query
│
▼
retriever.py ─────────► Top-5 similar funds
│
▼
llm.py ──────────────► Structured Answer
(LLaMA 3.3 via Groq)
│
▼
app.py (Streamlit UI)
- Real financial data — pulls live NAV from AMFI and 5 years of history from mfapi.in
- Quantitative metrics — computes CAGR (1Y/3Y/5Y), Sharpe Ratio, and Sortino Ratio per fund
- Semantic search — HuggingFace embeddings find relevant funds even for vague queries
- Structured answers — LLM outputs comparison tables, not paragraphs
- Fast inference — Groq free tier delivers answers in ~2 seconds
- Fully local vector DB — ChromaDB stores embeddings on disk, no external service needed
| Layer | Technology |
|---|---|
| Data ingestion | Python · Requests · Pandas |
| Metrics computation | NumPy · Pandas |
| Embeddings | HuggingFace all-MiniLM-L6-v2 |
| Vector store | ChromaDB (local, persistent) |
| LLM | LLaMA 3.3 70B via Groq API (free) |
| UI | Streamlit |
| Storage format | Parquet · JSONL · CSV |
mf-assistant/
├── data/
│ ├── fetch_schemes.py # Downloads AMFI fund list
│ ├── fetch_nav.py # Downloads 5Y NAV history from mfapi.in
│ ├── scheme_list.csv # ~36 Large Cap funds
│ ├── nav_history.parquet # ~32,000 rows of daily NAV
│ ├── fund_metrics.csv # Computed CAGR, Sharpe, Sortino
│ ├── chunks.jsonl # One text document per fund
│ └── chroma_db/ # Local vector store (auto-generated)
├── processing/
│ ├── compute_metrics.py # Financial metric calculations
│ └── build_chunks.py # Converts metrics to RAG-ready text
├── rag/
│ ├── embed_chunks.py # Embeds chunks → stores in ChromaDB
│ ├── retriever.py # Semantic search over ChromaDB
│ └── llm.py # Groq LLM call + prompt engineering
├── app.py # Streamlit UI
├── requirements.txt
├── .env.example
└── .gitignore
git clone https://github.com/YOUR_USERNAME/mf-assistant.git
cd mf-assistantpython -m venv venv
source venv/bin/activate # Mac/Linux
venv\Scripts\activate # Windowspip install -r requirements.txtcp .env.example .env
# Edit .env and add your key
# Get a free key at https://console.groq.compython data/fetch_schemes.py # ~5 seconds
python data/fetch_nav.py # ~3-5 minutes
python processing/compute_metrics.py
python processing/build_chunks.py
python rag/embed_chunks.py # Downloads model on first run (~90MB)streamlit run app.pyOpen http://localhost:8501 in your browser.
Query: "Which large cap fund gave the best 3 year returns?"
| Fund Name | 1Y Return | 3Y Return | 5Y Return | Sharpe | Sortino | Assessment |
|---|---|---|---|---|---|---|
| BANDHAN Large Cap Fund | 2.1% | 16.28% | 14.9% | 0.767 | 1.023 | Strong performer |
| DSP Large Cap Fund | 3.4% | 15.91% | 13.7% | 0.821 | 1.130 | Strong performer |
| Kotak Large Cap Fund | 1.8% | 14.21% | 12.8% | 0.743 | 0.991 | Good performer |
Key Takeaway: DSP Large Cap Fund edges out on risk-adjusted returns (higher Sharpe + Sortino), while BANDHAN leads on raw 3Y CAGR. For risk-conscious investors, DSP is the stronger pick.
⚠️ Past performance does not guarantee future returns. This is not financial advice.
-
Data collection — AMFI provides the master list of all mutual fund schemes. mfapi.in provides historical daily NAV prices going back 5 years.
-
Metric computation — For each fund, CAGR is computed as
(End NAV / Start NAV)^(1/years) - 1. Sharpe and Sortino ratios are computed using daily returns against a 6.5% annualised risk-free rate (Indian 10Y govt bond approximation). -
Chunk building — Each fund's metrics are converted into a structured natural language document. This is what the LLM reads — not raw numbers.
-
Embedding — Each document is embedded using
sentence-transformers/all-MiniLM-L6-v2, a 384-dimension model optimised for semantic similarity. Stored in ChromaDB with cosine similarity. -
Retrieval — User query is embedded with the same model. Top-5 most similar fund documents are retrieved from ChromaDB.
-
Generation — Retrieved chunks + user query are sent to LLaMA 3.3 70B (via Groq) with a strict prompt that enforces table-based structured output.
# .env.example
GROQ_API_KEY=your_groq_api_key_here- Currently covers Large Cap equity funds only (~36 funds)
- Data freshness depends on when you last ran
fetch_schemes.pyandfetch_nav.py - Hindi/multilingual queries work but retrieval quality is lower (model is English-first)
- Not financial advice — for educational and portfolio demonstration purposes only
- Add Mid Cap and Small Cap fund categories
- Add fund AUM and expense ratio to chunks
- Add benchmark comparison (Nifty 50 vs fund returns)
- Deploy on Streamlit Cloud
- Add date-aware queries ("best fund of 2023")
MIT License — free to use, modify, and distribute.
