Info201 TA Agent

This AI Agent used A Retrieval-Augmented Generation (RAG) system to answers student questions about R programming and data science using content from the Info201 textbook at the University of Washington.

🚀 Features

Web Scraping & Content Extraction: Automatically crawls and extracts structured content from Info201 book chapters [https://faculty.washington.edu/otoomet/info201-book/]
Intelligent Chunking: Splits content into 300-600 character chunks with sentence boundary awareness
Semantic Search: Uses Ollama embeddings (nomic-embed-text) for content retrieval
Context-Aware Responses: Generates answers with source citations, section references, and code examples
Simple CLI: Easy-to-use command-line interface for querying the system
Persistent Storage: ChromaDB for efficient vector storage and retrieval

📋 Prerequisites

Python 3.8+
Ollama installed and running on http://localhost:11434

Required Ollama models:

ollama pull nomic-embed-text  # For embeddings
ollama pull qwen3:0.6b        # For chat responses

🔧 Installation

Clone the repository:

git clone https://github.com/Marc0Guo/TAI.git
cd TAI

Install Python dependencies:

pip install -r requirements.txt

Ensure Ollama is running:
```
ollama serve
```

🏃 Quick Start

Step 1: Ingest Content

Extract content from the Info201 book website:

python ingestion.py

This will:

Discover all chapter URLs from the Info201 book
Extract structured content (chapters, sections, text, code blocks)
Save to info201_data.json

Step 2: Index Content

Create embeddings and store in ChromaDB:

python indexing.py

This will:

Load data from info201_data.json
Chunk content into 300-600 character pieces
Generate embeddings using Ollama
Store in ChromaDB collection info201_book (saved to ./chroma_db/)

Step 3: Query the System

Ask questions about R and data science:

python query.py "How to create a scatterplot in R?"

💻 Usage

Single Question Mode

Ask a single question by providing it as an argument:

python query.py "your question here"

Example:

# R syntax questions
python query.py "How to load CSV files in R?"

# Data visualization
python query.py "How to create a scatterplot with ggplot2?"

# Data manipulation
python query.py "How to filter data frames in R?"

Interactive Mode

Run without arguments to enter interactive mode, where you can ask multiple questions:

python query.py

Then type your questions. Type quit, exit, or q to exit.

Example Session:

Info201 TA Agent - Interactive Mode
Type your question (or 'quit'/'exit' to exit):
============================================================

> How to load CSV files in R?

[Response appears here...]

> How to create a scatterplot?

[Response appears here...]

> quit
Goodbye!

📁 Project Structure

TAI/
├── ingestion.py          # Web scraping and content extraction
├── indexing.py           # Chunking, embedding generation, and ChromaDB indexing
├── query.py              # Query interface and RAG pipeline
├── requirements.txt      # Python dependencies
├── info201_data.json     # Extracted content (generated)
└── chroma_db/            # ChromaDB persistent storage (generated)

🏗️ Architecture

Pipeline Overview

Ingestion (ingestion.py)
- Fetches HTML pages from Info201 book website
- Extracts chapter titles, section headings, text content, and R code blocks
- Structures data with metadata (URL, chapter, section)
- Outputs JSON file with structured entries
Indexing (indexing.py)
- Loads structured data from JSON
- Chunks text into 300-600 character pieces (sentence-aware)
- Generates embeddings using Ollama's nomic-embed-text model
- Stores embeddings and metadata in ChromaDB
Query (query.py)
- Takes student question as input
- Computes embedding for the question
- Retrieves top-k most similar chunks from ChromaDB
- Builds context with source URLs and section information
- Generates response using Ollama's qwen3:0.6b model
- Returns answer with citations

Data Flow

Info201 Book Website
    ↓
[ingestion.py] → info201_data.json
    ↓
[indexing.py] → ChromaDB (embeddings + metadata)
    ↓
[query.py] → Student Question → Answer with Citations

📊 Data Format

Input (from ingestion)

{
  "chapter_title": "Data Frames",
  "section_title": "Loading csv files",
  "url": "https://faculty.washington.edu/otoomet/info201-book/data-frames.html",
  "text_chunk": "read_delim() reads the given csv file...",
  "code_block": "data <- read_delim('file.csv')"
}

ChromaDB Storage

Each chunk includes:

Document: Chunked text content
Embedding: Vector representation from Ollama
Metadata:
- chapter_title: Source chapter
- section_title: Source section
- url: Source URL
- has_code: Boolean flag
- code_block: R code (if present)
- chunk_index: Position within entry
- total_chunks: Total chunks for entry

⚙️ Configuration

Default settings (can be modified in code):

Embedding Model: nomic-embed-text
Chat Model: qwen3:0.6b
Chunk Size: 300-600 characters
Top-K Retrieval: 5 chunks
Collection Name: info201_book
DB Path: ./chroma_db

🔍 How It Works

Question Processing: User question is converted to an embedding vector
Semantic Search: ChromaDB finds the most similar content chunks
Context Building: Retrieved chunks are formatted with URLs, sections, and code examples
Response Generation: Ollama chat model generates answer based on context
Citation: Response includes source URL and section reference

📝 System Rules

The TA agent follows strict guidelines:

✅ Answers only based on Info201 book content
✅ Provides source citations with every answer
✅ Prefers code examples from the textbook
✅ Admits when it doesn't know: "Sorry, I don't know. Please ask a human TA."
✅ Only answers R and data science topics
❌ Cannot provide answers to other questions
❌ Cannot directly provide answer to homework questions

🛠️ Dependencies

beautifulsoup4 - HTML parsing
requests - HTTP requests
chromadb - Vector database
ollama - Local LLM API client

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Info201 TA Agent

🚀 Features

📋 Prerequisites

🔧 Installation

🏃 Quick Start

Step 1: Ingest Content

Step 2: Index Content

Step 3: Query the System

💻 Usage

Single Question Mode

Interactive Mode

📁 Project Structure

🏗️ Architecture

Pipeline Overview

Data Flow

📊 Data Format

Input (from ingestion)

ChromaDB Storage

⚙️ Configuration

🔍 How It Works

📝 System Rules

🛠️ Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
indexing.py		indexing.py
ingestion.py		ingestion.py
query.py		query.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Info201 TA Agent

🚀 Features

📋 Prerequisites

🔧 Installation

🏃 Quick Start

Step 1: Ingest Content

Step 2: Index Content

Step 3: Query the System

💻 Usage

Single Question Mode

Interactive Mode

📁 Project Structure

🏗️ Architecture

Pipeline Overview

Data Flow

📊 Data Format

Input (from ingestion)

ChromaDB Storage

⚙️ Configuration

🔍 How It Works

📝 System Rules

🛠️ Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages