An AI-powered academic advisor chatbot for San José State University students. Ask natural-language questions about course syllabi and receive grounded, citation-backed answers drawn from a PostgreSQL vector database.
- Overview
- Architecture
- Prerequisites
- Project Structure
- Environment Variables
- Database Setup
- Data Pipeline
- Python Backend
- Frontend
- Running the Application
- Testing
- Backup and Recovery
- API Reference
The Course Advisor answers student questions about SJSU course syllabi using a ReWOO-style agentic architecture:
- Planner — an LLM call that produces a structured JSON retrieval plan
- Worker — executes tool calls (vector similarity search against PostgreSQL/pgvector), running dependency-free steps concurrently via
asyncio.gather - Solver — synthesises evidence into a student-facing answer, citing the original syllabus URLs
Every answer is grounded strictly in official syllabus text; the agent is instructed never to hallucinate beyond what the documents contain.
┌──────────────────────────────────────────────────────────┐
│ Browser / Client │
│ React + Vite + Tailwind CSS │
└────────────────────────┬─────────────────────────────────┘
│ HTTP (proxied in dev)
┌────────────────────────▼─────────────────────────────────┐
│ FastAPI (course_advisor_api.py) │
│ /ask /feedback /admin – session management, CORS │
└────────────────────────┬─────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────┐
│ CourseAdvisor (course_advisor.py) │
│ LangGraph ReWOO: Planner → Worker → Solver │
│ LLM: Google Gemini via LangChain + google-genai │
└──────────┬─────────────────────────────┬─────────────────┘
│ asyncpg │ Gemini Embeddings API
┌──────────▼──────────┐ ┌──────────▼──────────┐
│ PostgreSQL + │ │ Google AI Platform │
│ pgvector │ │ (text-embedding-*) │
│ (syllabus chunks) │ └─────────────────────┘
└─────────────────────┘
Data pipeline (one-time setup):
scrape_syllabi.py → chunk_syllabi.py → embed_chunks.py → load_database.py
| Requirement | Version |
|---|---|
| Python | 3.11+ |
| Node.js | 18+ |
| PostgreSQL | 14+ with pgvector |
| Google AI API key | Get one here |
Install pgvector by following the instructions at https://github.com/pgvector/pgvector before proceeding.
SJSU_course_advisor/
├── course_advisor.py # ReWOO agent: Planner, Worker, Solver
├── course_advisor_api.py # FastAPI app: /ask, /feedback, /admin endpoints
├── course_advisor_db.sql # PostgreSQL schema (colleges → chunks + feedback)
├── requirements.txt # Python dependencies
├── .env # Environment variables (do not commit)
│
├── frontend/ # React + Vite + Tailwind UI
│ ├── src/
│ │ ├── App.tsx # Main chat interface
│ │ ├── AdminPage.tsx # Feedback review dashboard
│ │ └── components/ui/ # shadcn/ui component library
│ ├── package.json
│ └── vite.config.ts # Dev proxy: /ask, /health → localhost:8000
│
├── scripts/
│ ├── scrape_syllabi.py # Playwright scraper for campusconcourse.com
│ ├── chunk_syllabi.py # Splits syllabus HTML into semantic chunks
│ ├── embed_chunks.py # Generates 768-dim Gemini embeddings
│ ├── embed_cmpe180b.py # Targeted embedder for a specific course
│ ├── load_database.py # Loads chunks_with_embeddings.csv → PostgreSQL
│ ├── course_advisor_db_indexing.sql # IVFFlat / HNSW vector index definitions
│ ├── feedback_test_data.sql # Seed data for feedback table tests
│ ├── backup.sh # pg_dump wrapper
│ └── restore.sh # Drop-recreate-restore wrapper
│
└── tests/
├── course_advisor_test.py # End-to-end agent smoke test
├── course_advisor_test_db_validation.py # pytest suite: schema, data, embeddings
├── concurrency_test.py # Async load test: burst, ramp, sustain
├── indexing_benchmark.py # EXPLAIN-based index speedup benchmark
├── indexing_tests.sql # SQL queries used by the benchmark
├── key_queries_test.py # pytest suite: vector search & feedback SQL queries
└── transactions_test.py # pytest suite: transaction isolation & rollback behaviour
Copy the template and fill in your values:
cp .env .env.local # optional: keep a local override.env contains:
DB_HOST=localhost
DB_PORT=5432
DB_NAME=course_advisor
DB_USER=postgres
DB_PASSWORD=YOUR_PASSWORD_HERE
GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY_HEREGOOGLE_API_KEY is used both by the backend agent (Gemini LLM + embeddings) and by the data-pipeline scripts. Never commit this file.
pip install -r requirements.txtKey packages:
| Package | Purpose |
|---|---|
fastapi, uvicorn |
ASGI web framework and server |
asyncpg, pgvector |
Async PostgreSQL driver with vector type support |
google-genai |
Google Gemini API (embeddings + LLM) |
langchain-google-genai, langgraph |
LangChain/LangGraph for the ReWOO agent graph |
pydantic |
Request/response schema validation |
python-dotenv |
.env file loading |
beautifulsoup4, playwright, pandas |
Data pipeline only |
psql -U postgres -c "CREATE DATABASE course_advisor;"
psql -U postgres -d course_advisor -f course_advisor_db.sql
# Also create the test database used by the validation suite
psql -U postgres -c "CREATE DATABASE course_advisor_test;"
psql -U postgres -d course_advisor_test -f course_advisor_db.sqlcourse_advisor_db.sql creates the following schema (all tables are dropped and recreated on each run):
college → department → subject → course → section → syllabus_chunk
↕
instructor (via instructor_department)
↕
feedback
syllabus_chunk stores a vector(768) column backed by pgvector.
Run these steps once to populate the database from scratch. If you have already downloaded the pre-built CSV, skip to step 4.
Crawls SJSU Campus Concourse for syllabus metadata and URLs (requires Playwright):
# First-time Playwright setup
playwright install chromium
python scripts/scrape_syllabi.py
# Output: scripts/syllabi_db.csvDownloads each syllabus HTML page and splits it into semantically coherent text chunks:
python scripts/chunk_syllabi.py
# Output: scripts/chunks.csvCalls the Google Gemini embedding API to produce 768-dimensional vectors for each chunk:
python scripts/embed_chunks.py
# Output: scripts/chunks_with_embeddings.csvThis step makes many API calls and may take a while depending on the dataset size. The script includes rate-limit handling.
Download the pre-built CSV (recommended): chunks_with_embeddings.csv on Google Drive
Place the file at scripts/chunks_with_embeddings.csv, then run:
python scripts/load_database.pyThis populates all tables (college, department, subject, course, section, instructor, syllabus_chunk) and inserts optional feedback seed data.
uvicorn course_advisor_api:app --reload --env-file .envThe API starts at http://localhost:8000. In production, remove --reload and set appropriate CORS origins in course_advisor_api.py.
Session behaviour: Each conversation is assigned a UUID session ID. The server retains up to 5 prior user questions per session (no answers) for follow-up context resolution. Sessions expire after 6 hours of inactivity.
cd frontend
npm install
npm run devThe Vite dev server starts at http://localhost:5173 and proxies /ask and /health to the FastAPI backend at http://localhost:8000 (configured in vite.config.ts).
cd frontend
npm run buildThe compiled static assets land in frontend/dist/. FastAPI serves them directly — navigate back to the project root and start the server as described above; visiting http://localhost:8000 loads the built frontend.
| Package | Purpose |
|---|---|
| React 19 + TypeScript | UI framework |
| Vite 7 | Build tool and dev server |
| Tailwind CSS 4 | Utility-first styling |
| shadcn/ui + Radix UI | Accessible component primitives |
| react-markdown | Renders markdown in chat responses |
| recharts | Charts on the admin feedback dashboard |
| lucide-react | Icons |
After completing setup, the typical development workflow is:
# Terminal 1 — backend
uvicorn course_advisor_api:app --reload --env-file .env
# Terminal 2 — frontend dev server (optional; skip if using the built frontend)
cd frontend && npm run devThen open http://localhost:5173 (dev) or http://localhost:8000 (production build).
Verifies the full Planner → Worker → Solver pipeline against the live database:
python tests/course_advisor_test.pyA pytest suite with 8 phases covering connection, schema, row counts, uniqueness, referential integrity, data quality, embedding dimensions, and a full DB ↔ CSV comparison:
pytest tests/course_advisor_test_db_validation.py -v -sRequires scripts/chunks_with_embeddings.csv to be present (used as the ground-truth source).
| Phase | Tests |
|---|---|
| 1 · Connection & Schema | DB reachable (SELECT 1), all 7 expected tables present |
| 2 · Count validations | Row counts match CSV for colleges, departments, subjects, courses, instructors, sections, and chunks |
| 3 · Set comparisons | Exact subject-code and instructor-name sets match CSV (no extras or missing in either direction) |
| 4 · Unique constraints | No duplicate subject codes, instructor names, or section composite keys |
| 5 · Referential integrity | No orphaned subjects (missing department), sections (missing course or instructor), or chunks (missing section) |
| 6 · Data quality | No null/empty course titles or instructor names; units positive; year in 2000–2100 range; delivery values in allowed set |
| 7 · Embedding quality | No null embeddings; all vectors share the same dimensionality; no sections with zero chunks |
| 8 · Grand finale | Full row-by-row, column-by-column DataFrame comparison of DB against source CSV |
Stress-tests the live API with three scenarios — burst, ramp, and sustain — and prints a latency/error summary table:
# Install extra deps if needed
pip install httpx rich
python tests/concurrency_test.pyConfiguration constants at the top of the file control concurrency levels (BURST_N, RAMP_MAX, SUSTAIN_RPS, SUSTAIN_DUR) and which scenarios to skip (SKIP_BURST, SKIP_RAMP, SKIP_SUSTAIN). The API server must be running before executing this test.
Measures query execution time and I/O block counts with and without vector indexes using EXPLAIN (ANALYZE, BUFFERS):
python tests/indexing_benchmark.pyPrints a Rich table comparing base vs. indexed query performance across multiple scenarios defined in tests/indexing_tests.sql.
A pytest suite that validates every SQL query used by the course advisor against the live course_advisor_test database — vector search (pure and CTE-filtered), feedback INSERT, stats/count/items queries, and the chart aggregation. Tests seed their own data via seed_chunk and use a per-test rollback fixture so the shared database is never permanently modified.
Requires pgvector and psycopg2-binary (both in requirements.txt). The test database must have the schema applied and the syllabus data loaded before running.
pytest tests/key_queries_test.py -vCovers:
| Group | Tests |
|---|---|
| Pure vector search | closest result, LIMIT respected, all metadata columns present |
| Filtered vector search (CTE) | subject filter, subject with no match, instructor ILIKE, year + session |
| Feedback INSERT | all fields persisted correctly |
| Feedback stats | total / positive / negative counts, empty-table zeros |
| Feedback items | newest-first ordering, ILIKE search, pagination |
| Feedback chart | daily grouping, oldest-first ordering, empty-table no rows |
A pytest suite that verifies the REPEATABLE READ transaction used by get_feedback, ensuring that stats, count, and item queries all observe a consistent snapshot even when concurrent writes are committed mid-transaction.
pytest tests/transactions_test.py -vEach test runs against the course_advisor_test database. The fixture commits a DELETE FROM feedback before each test (using autocommit=True) so that concurrent connections opened inside the tests also see a clean baseline — an uncommitted delete would be invisible to those connections and leave the 100 k seed rows in view.
Covers:
| Test | What it verifies |
|---|---|
test_repeatable_read_isolates_concurrent_insert |
A committed insert on a second connection is invisible inside an open REPEATABLE READ transaction |
test_stats_reflect_full_table_not_filter |
Stats query counts all rows; a filtered sub-query operates independently in the same snapshot |
test_rollback_on_error |
An exception mid-transaction rolls back all writes in that block |
test_committed_insert_is_visible |
A committed insert is immediately visible to a subsequent query on the same connection |
test_pagination_within_transaction |
LIMIT/OFFSET pages are non-overlapping and stable within a single REPEATABLE READ transaction |
./scripts/backup.sh
# Saves a timestamped .sql dump to ./backups/Download the latest backup if you don't have one: course_advisor_db backup on Google Drive
./scripts/restore.sh backups/course_advisor_db_2026-04-26_03-21-28.sqlThe restore script will ask for confirmation before dropping and recreating the database.
All endpoints are served by course_advisor_api.py at http://localhost:8000.
Submit a question and receive an AI-generated answer grounded in syllabus data.
Request body:
{
"question": "What topics are covered in CMPE 180B?",
"session_id": "optional-uuid-string"
}Response:
{
"answer": "CMPE 180B covers...",
"session_id": "uuid-assigned-or-echoed"
}Record a thumbs-up or thumbs-down on a response.
Request body:
{
"session_id": "uuid",
"question": "What topics are covered in CMPE 180B?",
"answer": "CMPE 180B covers...",
"is_positive": true
}Returns {"status": "ok"} when the server is running.
Returns feedback records with optional filters. Query parameters:
| Parameter | Type | Description |
|---|---|---|
is_positive |
bool | Filter by thumbs-up (true) or thumbs-down (false) |
limit |
int | Maximum number of records to return |
offset |
int | Pagination offset |
