Requires Python ≥ 3.14. Dependencies are managed with uv
(locked in uv.lock).
uv syncThis project can work with any OpenAI-compatible API.
- Modify
bss/config.pyto adjust relevant fields (such asopenai_base_urlandllm_model) as per your requirements. - Write your API key into a
.envfile in the project root:
echo 'OPENAI_API_KEY=sk-or-...' > .envThe embedder uses BAAI/bge-small-en-v1.5. Before first use, pull it into your local HuggingFace cache:
uv run python -m bss download# One-time: download the embedding model into cache
uv run python -m bss download
# Index a file or a directory (recursively).
# Supports pdf / txt / csv
uv run python -m bss ingest docs
# One-shot question (no memory)
uv run python -m bss query "What was the sample size in the second study?"
# One-shot, with the agentic tool-calling loop
uv run python -m bss query --agent "List the documents you have indexed."
# Interactive REPL with conversation memory
uv run python -m bss chat
# REPL + agent loop
uv run python -m bss chat --agent--index-dir <path> overrides the default storage/ for any subcommand.
All answers are followed by a list of every
source (locator) score=... the model was shown. If retrieval finds nothing above the
similarity threshold (and recovery also fails in agent mode), the system will refuse to answer.
-
During Retrieval: Only the top-k entries with a cosine score of at least 0.30 are given to the model as context. If nothing clears the threshold, the non-agent path immediately refuses to answer without ever calling the LLM.
-
Strict Prompt: The system prompt forbids outside knowledge, forbids extrapolating or combining facts into new claims, requires an inline citation on every claim, and mandates the exact refusal string
I don't know based on the provided documents.when the question is unsupported. Openers such as "Based on the documents…", "According to…" are explicitly forbidden so that attribution is done by citations. -
Deterministic Decoding:
temperature=0.0on every LLM call.
| Module | Responsibility |
|---|---|
config.py |
Contains all customizable parameters (chunk size, top-k, threshold, models, paths) |
loaders.py |
Converts files to RecordsPDF => 1 record/page CSV => 1 record/row Text => 1 record/file |
chunker.py |
Fixed-size sliding window over each record's text with attached data about the source |
embedder.py |
wrapper for SentenceTransformer |
vectorstore.py |
FaissStore around IndexFlatIP |
retriever.py |
Embeds the query, retrieves the k nearest chunks from the FAISS index, then drop hits below specific threshold. |
prompt.py |
Contains the system prompt and a function that builds the complete prompt by prepending retrieval results |
llm.py |
OpenAI-compatible client |
memory.py |
Maintains a rolling window of recent turns and provides hints for retrieval |
tools.py |
Contains the four agent tools and handles their dispatch |
pipeline.py |
Orchestration: ingest (load → chunk → embed → index), answer (non-agent), and agent_answer (tool loop) |
__main__.py |
Parses CLI options and handles the user interface |
Memory keeps a fixed number of recent turns. The number of turns can be adjusted using CFG.memory_window_turns.
history_messages() is injected between the system and user messages so the LLM can resolve references to previous messages.
retrieval_hint() appends the last question and a snippet of the last LLM response to the embedding query, so that
follow-up questions can be answered by retrieval even if they lack standalone context.
agent_answer runs the same initial steps as answer (load, embed, initial retrieve,
build messages with memory hint), then enters a tool-calling loop
(with maximum number of steps determined by CFG.agent_max_steps).
| Function | Parameters | Purpose |
|---|---|---|
list_sources |
- | Returns a sorted list of distinct sources in index. |
retrieve |
query, source=None |
Embed-and-search. If source is None, returns results from all sources. |
keyword_search |
pattern, source=None |
Case-insensitive substring scan over chunk text. |
get_neighbors |
source, locator, before=1, after=1 |
Retrieves chunks adjacent to a (source, locator) within the same base locator (same page for PDFs, row for CSVs, file for TXT) - adjacency never crosses page/record boundaries. before and after are capped at 3 to avoid context from blowing up. |
The tool-calling loop ends when the LLM responds with no tool calls (that response is the final answer) or when it has run CFG.agent_max_steps iterations. After the loop:
- If a final answer was produced, it is returned.
- Otherwise, if any context was gathered (initial retrieval or tool calls), the LLM is called once more without tools to force a final answer over everything collected.
- If no context was gathered at all, the system returns the exact refusal string.
AI was used to assist with debugging and generating system prompts.