CS4485-Team-10 · AdvayChandramouli · May 8, 2026 · May 7, 2026 · May 8, 2026
diff --git a/README.md b/README.md
@@ -1,180 +1,204 @@
 # YouTube Intelligence Platform — Backend
 
-Backend services and pipelines for ingesting YouTube content, extracting health-related claims, matching them into narratives, and serving data via a FastAPI API.
+Backend for the **YouTube Intelligence Platform**: it ingests YouTube data, extracts health-related claims with an LLM, groups claims into narratives, and exposes the results through a versioned REST API.
 
-## Overview
+## What This Backend Does
 
-This repository contains:
+- Pulls YouTube metadata and transcripts into Supabase
+- Filters and prioritizes content for public-health relevance (including semantic filtering and impact-style signals in batch flows).
+- Uses an LLM to extract generalizable **health-related claims** from transcripts.
+- Matches claims to **narratives** using embeddings and creates new narratives when needed.
+- Serves **versioned API endpoints** (`/api/v1/...`) so clients can query health, overview, claims, narratives, ingestion, and related resources.
 
-- **FastAPI service** under `app/` with versioned REST endpoints (`/api/v1/...`).
-- **Database models + migrations** (SQLModel/SQLAlchemy + Alembic) for the core analytics schema.
-- **Pipelines** for:
-  - ingesting YouTube metadata + transcripts into Supabase
-  - extracting claims via an LLM provider (local Ollama by default)
-  - matching claims to narratives using embeddings (AWS Bedrock Titan Text Embeddings V2)
-- **Scripts** in `scripts/` for running pipeline utilities locally.
+## Architecture Overview
+
+Requests hit a **FastAPI** service. It reads and writes from **Supabase** via SQLModel/SQLAlchemy. **Ingestion pipelines** fetch YouTube data (Data API + transcripts) and upsert into core tables. A separate **LLM pipeline** extracts claims and runs **narrative matching** (embedding similarity, backed by Amazon Bedrock Titan embeddings in the default setup). **Deployment** can be configured via`render.yaml`.
 
 ## Tech Stack
 
-- **API**: FastAPI + Uvicorn
-- **DB/ORM & Migrations**: Supabase, SQLAlchemy 2.x, SQLModel, Alembic (`alembic/`)
-- **YouTube Data Ingestion**: YouTube Data API v3 + `youtube-transcript-api`
-- **LLM inference**: Switchable Ollama/Bedrock provider abstraction (env-driven)
-- **Embeddings**: AWS Bedrock Titan Text Embeddings V2 (`amazon.titan-embed-text-v2:0`) via `bedrock-runtime` `invoke_model`
-- **Tooling**: Ruff (format + lint), `python-dotenv`
 
-## Setup
+| Area                         | Technologies                                                                                                    |
+| ---------------------------- | --------------------------------------------------------------------------------------------------------------- |
+| API & runtime                | FastAPI, Uvicorn                                                                                                |
+| DB & Schema                  | Supabase (Postgres), SQLAlchemy 2.x, SQLModel                                                                   |
+| Migrations                   | Alembic (`alembic/`)                                                                                            |
+| YouTube Data Ingestion       | YouTube Data API v3, `youtube-transcript-api`                                                                   |
+| Large Language Models (LLMs) | Switchable **Ollama / Amazon Bedrock** (environment-driven)                                                     |
+| Narrative Embeddings         | Amazon Bedrock - Titan Text Embeddings V2 (`amazon.titan-embed-text-v2:0`) via `bedrock-runtime` `invoke_model` |
+| Quality of life              | Ruff (format + lint), `python-dotenv`, Pydantic settings                                                        |
+
+
+## Repository Structure
+
+- `app/` — The live API: routers under `/api/v1`, configuration, database sessions, SQLModel tables, and request/response schemas. Pipeline helpers used by HTTP endpoints (for example, single-video ingest) live here under `app/pipelines/`.
+- `pipelines/` — Batch jobs you run locally or can be configured to run async via AWS Lambda: YouTube search + filtering + persistence, LLM claim extraction and narrative linking, embedding-based matching helpers, and shared pipeline utilities.
+- `alembic/` — Database migrations (`versions/`) and Alembic runtime configuration (`env.py`).
+- `scripts/` — Developer utilities (for example, git hook setup and local orchestration scripts).
 
-### Python environment (venv)
+## Installation & Setup
 
-From `backend/`:
+From `backend/`, create a virtual environment and install dependencies:
 
 ```bash
 python -m venv .venv
 source .venv/bin/activate
 pip install -r requirements.txt
 ```
 
-Optional dev tooling (Ruff + hooks):
+Optional formatting, linting, and pre-commit hooks:
 
 ```bash
 pip install -r requirements-dev.txt
 bash scripts/setup-hooks.sh
 ```
 
-### Ruff (format + lint)
+Format and lint with Ruff:
 
 ```bash
 ruff format .
 ruff check .
 ```
 
-If you ran `scripts/setup-hooks.sh`, Ruff also runs automatically on staged files via the pre-commit hook.
+If you ran `scripts/setup-hooks.sh`, Ruff also runs on staged files via the pre-commit hook.
 
-### Environment variables (`.env`)
+## Running the Project
 
-Create a local `.env` (do not commit). This repo loads it via Pydantic settings and `python-dotenv`.
+**API server** — Starts the FastAPI app with auto-reload for local development:
 
-- **Required (local dev)**
-  - `DATABASE_URL`
-  - `SUPABASE_URL`
-  - `SUPABASE_SERVICE_ROLE_KEY`
-  - `YOUTUBE_DATA_API_KEY` (or `YOUTUBE_API_KEY`)
-- **Required (narrative matching / embeddings)**
-  - `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION` (or `AWS_DEFAULT_REGION`)
-- **Common optional**
-  - `FRONTEND_URL`, `PORT`, `ENV`
-  - `LLM_PROVIDER` (default `ollama`), `LLM_MODEL`, `OLLAMA_BASE_URL`
-  - `YT_QUOTA_DAILY_BUDGET_UNITS`
-  - `NARR_EMBEDDING_BACKEND=bedrock` (default), `NARR_EMBEDDING_MODEL=amazon.titan-embed-text-v2:0`, `NARR_EMBEDDING_DIMENSIONS=512`, `NARR_EMBEDDING_NORMALIZE=true`
-  - `NARR_EMBEDDING_TIMEOUT`, `NARR_STRONG_MATCH`, `NARR_MULTI_LINK`, `NARR_NEW_MIN`, `NARR_MAX_PER_CLAIM`
+```bash
+uvicorn app.main:app --reload
+```
 
-Provider notes:
+Health check: `GET /api/v1/health`
 
-- `pipelines/yt_data_ingestion.py` semantic filtering now uses `LLM_PROVIDER` + `LLM_MODEL` as its switch mechanism (`ollama` or `bedrock`), with ingestion default model `gemma2` when `LLM_MODEL` is unset.
-- `pipelines/llm_insight_generation.py` uses the same provider variables, with default model `qwen3` when `LLM_MODEL` is unset.
+---
 
-Google Console:
+**Ingest a single video (via API)** — Fetches video, channel, and transcript data and upserts into Supabase (`channels`, `videos`, `transcripts`):
 
-- **YouTube**: enable *YouTube Data API v3* → create API key → set `YOUTUBE_DATA_API_KEY`
+- `POST /api/v1/ingest/video` with JSON `{ "video_id": "<id>" }`  
+(Implementation uses `app/pipelines/yt_ingest.py`.)
 
-AWS Console:
+---
 
-- **Bedrock embeddings**: in your `AWS_REGION`, request access to `amazon.titan-embed-text-v2:0` in the Bedrock model catalog and ensure your IAM principal has `bedrock:InvokeModel` for that model
+**Batch ingest + filter** — Searches YouTube, applies an LLM semantic filter for public-health relevance, applies impact-style filtering, and persists videos and transcripts to Supabase:
 
-### Database migrations (Alembic)
+```bash
+python -m pipelines.yt_data_ingestion
+```
 
-Migrations live in `alembic/versions/`. Alembic reads the DB URL from `DATABASE_URL` (wired in `alembic/env.py`; `alembic.ini` is intentionally URL-less).
+Example provider overrides:
 
 ```bash
-alembic upgrade head
-alembic current
-alembic revision --autogenerate -m "describe your change"
+# local test
+LLM_PROVIDER=ollama LLM_MODEL=gemma2 python -m pipelines.yt_data_ingestion
+
+# cloud runtime
+LLM_PROVIDER=bedrock LLM_MODEL=gemma2 python -m pipelines.yt_data_ingestion
 ```
 
-## Running
+---
 
-### Run the API server
+**Claims + narratives (LLM insight generation)** — Reads transcripts without claims yet, extracts health-related claims, matches them to narratives with embeddings (default: Amazon Bedrock Titan Text Embeddings V2), opens new narratives when appropriate, and writes `claims`, `narratives`, and `claim_narratives`:
 
 ```bash
-uvicorn app.main:app --reload
+python -m pipelines.llm_insight_generation
 ```
 
-Health check:
+## Environment Variables
 
-- `GET /api/v1/health`
+Create a local `.env` (never commit it). The app loads it through Pydantic settings and `python-dotenv`.
 
-### Run pipelines locally
+**Required for typical local development**
 
-#### Ingest a single video (API route)
 
-- `POST /api/v1/ingest/video` with JSON `{ "video_id": "<id>" }`
-  - Uses `app/pipelines/yt_ingest.py` to fetch video/channel/transcript and upserts to Supabase tables (`channels`, `videos`, `transcripts`).
+| Variable                                    | Purpose                                  |
+| ------------------------------------------- | ---------------------------------------- |
+| `DATABASE_URL`                              | Postgres connection for SQLModel/Alembic |
+| `SUPABASE_URL`                              | Supabase project URL                     |
+| `SUPABASE_SERVICE_ROLE_KEY`                 | Service role access for Supabase         |
+| `YOUTUBE_DATA_API_KEY` or `YOUTUBE_API_KEY` | YouTube Data API v3                      |
 
-#### Batch ingest + filter (script entrypoint)
 
-```bash
-python -m pipelines.yt_data_ingestion
-```
+**Required for narrative matching / embeddings (default Bedrock path)**
 
-This pipeline searches YouTube, applies an LLM semantic filter (public-health relevance), filters by impact metrics, and persists `videos` + `transcripts` to Supabase.
 
-Example provider switching:
+| Variable                                     | Purpose                  |
+| -------------------------------------------- | ------------------------ |
+| `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` | AWS credentials          |
+| `AWS_REGION` or `AWS_DEFAULT_REGION`         | Region for Bedrock calls |
 
-```bash
-# local test
-LLM_PROVIDER=ollama LLM_MODEL=gemma2 python -m pipelines.yt_data_ingestion
 
-# cloud runtime
-LLM_PROVIDER=bedrock LLM_MODEL=gemma2 python -m pipelines.yt_data_ingestion
-```
+**Common optional**
+
+
+| Variable                                                                                               | Notes                                  |
+| ------------------------------------------------------------------------------------------------------ | -------------------------------------- |
+| `FRONTEND_URL`, `PORT`, `ENV`                                                                          | App/runtime tuning                     |
+| `LLM_PROVIDER`                                                                                         | Default `ollama`; switch with env      |
+| `LLM_MODEL`, `OLLAMA_BASE_URL`                                                                         | Model and Ollama endpoint              |
+| `YT_QUOTA_DAILY_BUDGET_UNITS`                                                                          | Caps YouTube quota usage               |
+| `NARR_EMBEDDING_BACKEND`                                                                               | Default `bedrock`                      |
+| `NARR_EMBEDDING_MODEL`                                                                                 | Default `amazon.titan-embed-text-v2:0` |
+| `NARR_EMBEDDING_DIMENSIONS`                                                                            | Default `512`                          |
+| `NARR_EMBEDDING_NORMALIZE`                                                                             | Default `true`                         |
+| `NARR_EMBEDDING_TIMEOUT`, `NARR_STRONG_MATCH`, `NARR_MULTI_LINK`, `NARR_NEW_MIN`, `NARR_MAX_PER_CLAIM` | Matching tuning knobs                  |
+
+
+
+
+## Deployment Notes
+
+- `render.yaml` defines a basic Render deployment for the FastAPI service
+
 
-#### LLM insight generation (claims + narratives)
+
+## Developer Notes
+
+### LLM Provider Mechanisms
+
+- `pipelines/yt_data_ingestion.py` uses `LLM_PROVIDER` + `LLM_MODEL` (`ollama` or `bedrock`). If `LLM_MODEL` is unset, ingestion defaults to `gemma2`.
+- `pipelines/llm_insight_generation.py` uses the same variables; if `LLM_MODEL` is unset, it defaults to `qwen3`.
+
+### Cloud Console Credentials
+
+- **Google Cloud**: enable *YouTube Data API v3*, create an API key, set `YOUTUBE_DATA_API_KEY`.
+- **AWS**: in your chosen region, request access to `amazon.titan-embed-text-v2:0` in the Bedrock model catalog and ensure the IAM principal can `bedrock:InvokeModel` for that model.
+
+### DB Migrations
+
+- Migrations live in `alembic/versions/`. Alembic reads `DATABASE_URL` from `alembic/env.py`; `alembic.ini` intentionally omits the URL.
 
 ```bash
-python -m pipelines.llm_insight_generation
+alembic upgrade head
+alembic current
+alembic revision --autogenerate -m "describe your change"
 ```
 
-This reads Supabase transcripts that don’t yet have claims, extracts generalizable health-related claims, matches them to existing narratives via embeddings (AWS Bedrock Titan Text Embeddings V2), creates new narratives when needed, and writes back to Supabase (`claims`, `narratives`, `claim_narratives`).
-
-## Architecture
 
-### `app/` (FastAPI application)
 
-- `**app/main.py**`: FastAPI app, CORS, router mounting at `/api/v1`
-- `**app/api/**`: versioned API routes
-  - `app/api/v1/endpoints/`: endpoints such as `health`, `overview`, `claims`, `narratives`, `ingest`, etc.
-- `**app/core/**`: application config and infrastructure
-  - `app/core/config.py`: Pydantic settings; loads `.env`
-  - `app/core/database.py`: DB session wiring (used by API endpoints)
-- `**app/models/**`: SQLModel models representing DB tables (videos, claims, narratives, join tables, etc.)
-- `**app/schemas/**`: Pydantic response/request schemas for API responses
-- `**app/pipelines/**`: API-facing pipeline helpers (e.g. `yt_ingest.py` used by `/ingest/video`)
 
-### `pipelines/` (batch + ML/LLM pipelines)
 
-Standalone pipeline modules (typically run via `python -m ...`):
+### API Layouts (`app/`)
 
-- `**pipelines/yt_data_ingestion.py**`: YouTube search + semantic filter + impact filter + persist to Supabase
-- `**pipelines/llm_insight_generation.py**`: claim extraction + narrative creation/linking + persist to Supabase
-- `**pipelines/narrative_matching.py**`: embedding + cosine similarity matching logic (AWS Bedrock Titan Text Embeddings V2 via `invoke_model`)
-- `**pipelines/shared/**`: shared pipeline utilities / interfaces
+For contributors who want a quick map of the API package:
 
-### `alembic/` (migrations)
+- `app/main.py` — FastAPI app, CORS, routers mounted at `/api/v1`
+- `app/api/v1/endpoints/` — Endpoints such as health, overview, claims, narratives, ingest, etc.
+- `app/core/config.py` — Settings + `.env` loading
+- `app/core/database.py` — Database sessions for routes
+- `app/models/` — SQLModel models (videos, claims, narratives, joins, etc.)
+- `app/schemas/` — Pydantic request/response shapes
 
-- `**alembic/env.py**`: migration runtime config (loads `DATABASE_URL`)
-- `**alembic/versions/**`: migration revisions
+### Data Pipeline Layouts (`pipelines/`)
 
-### `scripts/` (developer utilities)
+- `pipelines/yt_data_ingestion.py` — Search + semantic filter + impact filter + transcript extraction
+- `pipelines/llm_insight_generation.py` — Claims, narratives extraction
+- `pipelines/narrative_matching.py` — Embedding + cosine similarity (Bedrock Titan Text Embeddings V2 via `invoke_model`)
+- `pipelines/shared/` — Shared helpers, interfaces and dataclasses
 
-Convenience scripts for running parts of the pipeline locally:
 
-- `scripts/run_pipeline.py`: runs selected pipeline scripts
-- `scripts/setup-hooks.sh`: installs dev deps + enables git hooks
-- Other one-off analysis scripts (`sentiment_analysis.py`, `misinfo_checker.py`, etc.)
 
-## Deployment notes
+### Related Repositories
 
-- `render.yaml` contains a basic Render configuration for running the FastAPI service.
-- Ensure all required secrets are configured as environment variables in the deployment environment (do not rely on a checked-in `.env`).
+- [YouTube Intelligence Platform — Frontend](https://github.com/CS4485-Team-10/frontend)