This AI Agent used A Retrieval-Augmented Generation (RAG) system to answers student questions about R programming and data science using content from the Info201 textbook at the University of Washington.
- Web Scraping & Content Extraction: Automatically crawls and extracts structured content from Info201 book chapters [https://faculty.washington.edu/otoomet/info201-book/]
- Intelligent Chunking: Splits content into 300-600 character chunks with sentence boundary awareness
- Semantic Search: Uses Ollama embeddings (
nomic-embed-text) for content retrieval - Context-Aware Responses: Generates answers with source citations, section references, and code examples
- Simple CLI: Easy-to-use command-line interface for querying the system
- Persistent Storage: ChromaDB for efficient vector storage and retrieval
- Python 3.8+
- Ollama installed and running on
http://localhost:11434 - Required Ollama models:
ollama pull nomic-embed-text # For embeddings ollama pull qwen3:0.6b # For chat responses
-
Clone the repository:
git clone https://github.com/Marc0Guo/TAI.git cd TAI -
Install Python dependencies:
pip install -r requirements.txt- Ensure Ollama is running:
ollama serve
Extract content from the Info201 book website:
python ingestion.pyThis will:
- Discover all chapter URLs from the Info201 book
- Extract structured content (chapters, sections, text, code blocks)
- Save to
info201_data.json
Create embeddings and store in ChromaDB:
python indexing.pyThis will:
- Load data from
info201_data.json - Chunk content into 300-600 character pieces
- Generate embeddings using Ollama
- Store in ChromaDB collection
info201_book(saved to./chroma_db/)
Ask questions about R and data science:
python query.py "How to create a scatterplot in R?"Ask a single question by providing it as an argument:
python query.py "your question here"Example:
# R syntax questions
python query.py "How to load CSV files in R?"
# Data visualization
python query.py "How to create a scatterplot with ggplot2?"
# Data manipulation
python query.py "How to filter data frames in R?"Run without arguments to enter interactive mode, where you can ask multiple questions:
python query.pyThen type your questions. Type quit, exit, or q to exit.
Example Session:
Info201 TA Agent - Interactive Mode
Type your question (or 'quit'/'exit' to exit):
============================================================
> How to load CSV files in R?
[Response appears here...]
> How to create a scatterplot?
[Response appears here...]
> quit
Goodbye!
TAI/
├── ingestion.py # Web scraping and content extraction
├── indexing.py # Chunking, embedding generation, and ChromaDB indexing
├── query.py # Query interface and RAG pipeline
├── requirements.txt # Python dependencies
├── info201_data.json # Extracted content (generated)
└── chroma_db/ # ChromaDB persistent storage (generated)
-
Ingestion (
ingestion.py)- Fetches HTML pages from Info201 book website
- Extracts chapter titles, section headings, text content, and R code blocks
- Structures data with metadata (URL, chapter, section)
- Outputs JSON file with structured entries
-
Indexing (
indexing.py)- Loads structured data from JSON
- Chunks text into 300-600 character pieces (sentence-aware)
- Generates embeddings using Ollama's
nomic-embed-textmodel - Stores embeddings and metadata in ChromaDB
-
Query (
query.py)- Takes student question as input
- Computes embedding for the question
- Retrieves top-k most similar chunks from ChromaDB
- Builds context with source URLs and section information
- Generates response using Ollama's
qwen3:0.6bmodel - Returns answer with citations
Info201 Book Website
↓
[ingestion.py] → info201_data.json
↓
[indexing.py] → ChromaDB (embeddings + metadata)
↓
[query.py] → Student Question → Answer with Citations
{
"chapter_title": "Data Frames",
"section_title": "Loading csv files",
"url": "https://faculty.washington.edu/otoomet/info201-book/data-frames.html",
"text_chunk": "read_delim() reads the given csv file...",
"code_block": "data <- read_delim('file.csv')"
}Each chunk includes:
- Document: Chunked text content
- Embedding: Vector representation from Ollama
- Metadata:
chapter_title: Source chaptersection_title: Source sectionurl: Source URLhas_code: Boolean flagcode_block: R code (if present)chunk_index: Position within entrytotal_chunks: Total chunks for entry
Default settings (can be modified in code):
- Embedding Model:
nomic-embed-text - Chat Model:
qwen3:0.6b - Chunk Size: 300-600 characters
- Top-K Retrieval: 5 chunks
- Collection Name:
info201_book - DB Path:
./chroma_db
- Question Processing: User question is converted to an embedding vector
- Semantic Search: ChromaDB finds the most similar content chunks
- Context Building: Retrieved chunks are formatted with URLs, sections, and code examples
- Response Generation: Ollama chat model generates answer based on context
- Citation: Response includes source URL and section reference
The TA agent follows strict guidelines:
- ✅ Answers only based on Info201 book content
- ✅ Provides source citations with every answer
- ✅ Prefers code examples from the textbook
- ✅ Admits when it doesn't know: "Sorry, I don't know. Please ask a human TA."
- ✅ Only answers R and data science topics
- ❌ Cannot provide answers to other questions
- ❌ Cannot directly provide answer to homework questions
beautifulsoup4- HTML parsingrequests- HTTP requestschromadb- Vector databaseollama- Local LLM API client