Skip to content

Marc0Guo/TAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Info201 TA Agent

This AI Agent used A Retrieval-Augmented Generation (RAG) system to answers student questions about R programming and data science using content from the Info201 textbook at the University of Washington.

🚀 Features

  • Web Scraping & Content Extraction: Automatically crawls and extracts structured content from Info201 book chapters [https://faculty.washington.edu/otoomet/info201-book/]
  • Intelligent Chunking: Splits content into 300-600 character chunks with sentence boundary awareness
  • Semantic Search: Uses Ollama embeddings (nomic-embed-text) for content retrieval
  • Context-Aware Responses: Generates answers with source citations, section references, and code examples
  • Simple CLI: Easy-to-use command-line interface for querying the system
  • Persistent Storage: ChromaDB for efficient vector storage and retrieval

📋 Prerequisites

  • Python 3.8+
  • Ollama installed and running on http://localhost:11434
  • Required Ollama models:
    ollama pull nomic-embed-text  # For embeddings
    ollama pull qwen3:0.6b        # For chat responses

🔧 Installation

  1. Clone the repository:

    git clone https://github.com/Marc0Guo/TAI.git
    cd TAI
  2. Install Python dependencies:

pip install -r requirements.txt
  1. Ensure Ollama is running:
    ollama serve

🏃 Quick Start

Step 1: Ingest Content

Extract content from the Info201 book website:

python ingestion.py

This will:

  • Discover all chapter URLs from the Info201 book
  • Extract structured content (chapters, sections, text, code blocks)
  • Save to info201_data.json

Step 2: Index Content

Create embeddings and store in ChromaDB:

python indexing.py

This will:

  • Load data from info201_data.json
  • Chunk content into 300-600 character pieces
  • Generate embeddings using Ollama
  • Store in ChromaDB collection info201_book (saved to ./chroma_db/)

Step 3: Query the System

Ask questions about R and data science:

python query.py "How to create a scatterplot in R?"

💻 Usage

Single Question Mode

Ask a single question by providing it as an argument:

python query.py "your question here"

Example:

# R syntax questions
python query.py "How to load CSV files in R?"

# Data visualization
python query.py "How to create a scatterplot with ggplot2?"

# Data manipulation
python query.py "How to filter data frames in R?"

Interactive Mode

Run without arguments to enter interactive mode, where you can ask multiple questions:

python query.py

Then type your questions. Type quit, exit, or q to exit.

Example Session:

Info201 TA Agent - Interactive Mode
Type your question (or 'quit'/'exit' to exit):
============================================================

> How to load CSV files in R?

[Response appears here...]

> How to create a scatterplot?

[Response appears here...]

> quit
Goodbye!

📁 Project Structure

TAI/
├── ingestion.py          # Web scraping and content extraction
├── indexing.py           # Chunking, embedding generation, and ChromaDB indexing
├── query.py              # Query interface and RAG pipeline
├── requirements.txt      # Python dependencies
├── info201_data.json     # Extracted content (generated)
└── chroma_db/            # ChromaDB persistent storage (generated)

🏗️ Architecture

Pipeline Overview

  1. Ingestion (ingestion.py)

    • Fetches HTML pages from Info201 book website
    • Extracts chapter titles, section headings, text content, and R code blocks
    • Structures data with metadata (URL, chapter, section)
    • Outputs JSON file with structured entries
  2. Indexing (indexing.py)

    • Loads structured data from JSON
    • Chunks text into 300-600 character pieces (sentence-aware)
    • Generates embeddings using Ollama's nomic-embed-text model
    • Stores embeddings and metadata in ChromaDB
  3. Query (query.py)

    • Takes student question as input
    • Computes embedding for the question
    • Retrieves top-k most similar chunks from ChromaDB
    • Builds context with source URLs and section information
    • Generates response using Ollama's qwen3:0.6b model
    • Returns answer with citations

Data Flow

Info201 Book Website
    ↓
[ingestion.py] → info201_data.json
    ↓
[indexing.py] → ChromaDB (embeddings + metadata)
    ↓
[query.py] → Student Question → Answer with Citations

📊 Data Format

Input (from ingestion)

{
  "chapter_title": "Data Frames",
  "section_title": "Loading csv files",
  "url": "https://faculty.washington.edu/otoomet/info201-book/data-frames.html",
  "text_chunk": "read_delim() reads the given csv file...",
  "code_block": "data <- read_delim('file.csv')"
}

ChromaDB Storage

Each chunk includes:

  • Document: Chunked text content
  • Embedding: Vector representation from Ollama
  • Metadata:
    • chapter_title: Source chapter
    • section_title: Source section
    • url: Source URL
    • has_code: Boolean flag
    • code_block: R code (if present)
    • chunk_index: Position within entry
    • total_chunks: Total chunks for entry

⚙️ Configuration

Default settings (can be modified in code):

  • Embedding Model: nomic-embed-text
  • Chat Model: qwen3:0.6b
  • Chunk Size: 300-600 characters
  • Top-K Retrieval: 5 chunks
  • Collection Name: info201_book
  • DB Path: ./chroma_db

🔍 How It Works

  1. Question Processing: User question is converted to an embedding vector
  2. Semantic Search: ChromaDB finds the most similar content chunks
  3. Context Building: Retrieved chunks are formatted with URLs, sections, and code examples
  4. Response Generation: Ollama chat model generates answer based on context
  5. Citation: Response includes source URL and section reference

📝 System Rules

The TA agent follows strict guidelines:

  • ✅ Answers only based on Info201 book content
  • ✅ Provides source citations with every answer
  • ✅ Prefers code examples from the textbook
  • ✅ Admits when it doesn't know: "Sorry, I don't know. Please ask a human TA."
  • ✅ Only answers R and data science topics
  • ❌ Cannot provide answers to other questions
  • ❌ Cannot directly provide answer to homework questions

🛠️ Dependencies

  • beautifulsoup4 - HTML parsing
  • requests - HTTP requests
  • chromadb - Vector database
  • ollama - Local LLM API client

About

A teaching AI Agent for INFO201

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages