From bee5853ced7cc638deefadc2e252ff6afa130747 Mon Sep 17 00:00:00 2001 From: svonava Date: Thu, 21 May 2026 21:04:22 -0700 Subject: [PATCH 1/3] feat(examples): vision-first document RAG with ColQwen2.5 + Florence-2-DocVQA A multi-tenant retrieval + QA example that keeps OCR out of the score path. Pages are encoded as images with ColQwen2.5, MaxSim ranks them via late interaction, and Florence-2-FT-DocVQA reads the top page to produce a textual answer. An optional Qwen3-VL-Reranker-2B second stage stays in the visual modality so layout cues survive both ranking stages. Exercises encode + extract (and score when enabled). Includes a synthetic 3-tenant corpus, a PIL renderer that turns each entry into a PNG, a FastAPI server, and a minimal UI that shows the page image alongside the answer. --- examples/README.md | 1 + examples/vision-doc-rag/.gitignore | 6 + examples/vision-doc-rag/README.md | 209 +++++++++++++++ examples/vision-doc-rag/config.yaml | 43 ++++ examples/vision-doc-rag/data/fetch_dataset.py | 211 +++++++++++++++ examples/vision-doc-rag/data/render_pages.py | 106 ++++++++ examples/vision-doc-rag/python/ingest.py | 119 +++++++++ .../vision-doc-rag/python/requirements.txt | 6 + examples/vision-doc-rag/python/search.py | 243 ++++++++++++++++++ examples/vision-doc-rag/python/server.py | 96 +++++++ examples/vision-doc-rag/static/index.html | 190 ++++++++++++++ 11 files changed, 1230 insertions(+) create mode 100644 examples/vision-doc-rag/.gitignore create mode 100644 examples/vision-doc-rag/README.md create mode 100644 examples/vision-doc-rag/config.yaml create mode 100644 examples/vision-doc-rag/data/fetch_dataset.py create mode 100644 examples/vision-doc-rag/data/render_pages.py create mode 100644 examples/vision-doc-rag/python/ingest.py create mode 100644 examples/vision-doc-rag/python/requirements.txt create mode 100644 examples/vision-doc-rag/python/search.py create mode 100644 examples/vision-doc-rag/python/server.py create mode 100644 examples/vision-doc-rag/static/index.html diff --git a/examples/README.md b/examples/README.md index cb92870e..2f80dc5f 100644 --- a/examples/README.md +++ b/examples/README.md @@ -18,6 +18,7 @@ service keys. | [Build a multimodal wine recommender with OCR](./wine-recommender) | Combining preference-based retrieval with OCR-driven label detection in one UI | `encode`, `score`, `extract` | Docker Compose app plus local SIE endpoint; API key optional for unauthenticated SIE | Runnable demo | | [Build a multi-modal product classifier with embeddings](./taxonomy-classification) | Evaluating text, image, NLI, and reranking approaches for hierarchical product taxonomy classification | `extract`, `encode`, `score` | SIE endpoint, Shopify dataset prep via `uv run` scripts, standalone `uv` project | Runnable evaluation example | | [Swap an OCR model with one identifier change](./document-ocr) | Driving recognition (VLM-OCR), structured extraction (Donut), and zero-shot NER (GLiNER) through the same `extract` call by swapping the model ID | `extract` | Docker Compose plus Node UI, no API key required, hosted version on [Hugging Face Spaces](https://huggingface.co/spaces/superlinked/document-ocr) | Runnable demo | +| [Vision-first document RAG](./vision-doc-rag) | Retrieving and answering questions over a multi-tenant page corpus by looking at page images, with OCR kept out of the score path | `encode`, `extract`, `score` (optional) | SIE endpoint with a GPU recommended for ColQwen2.5 + Florence-2-DocVQA | Runnable demo | For docs publishing, lead with the quickest runnable demos, then use the benchmark and evaluation examples for deeper technical users. diff --git a/examples/vision-doc-rag/.gitignore b/examples/vision-doc-rag/.gitignore new file mode 100644 index 00000000..a787e920 --- /dev/null +++ b/examples/vision-doc-rag/.gitignore @@ -0,0 +1,6 @@ +.venv/ +__pycache__/ +data/pages.json +data/pages/ +data/multivectors.npz +data/metadata.json diff --git a/examples/vision-doc-rag/README.md b/examples/vision-doc-rag/README.md new file mode 100644 index 00000000..f179051c --- /dev/null +++ b/examples/vision-doc-rag/README.md @@ -0,0 +1,209 @@ +# Vision-first document RAG + +Retrieve by image, answer by image. ColQwen2.5 reads each page as a picture +and ranks them via late interaction; Florence-2-DocVQA reads the winning +page and produces the textual answer. OCR never enters the score path, so +charts, screenshots, tables, and any other layout cue that would die in a +text round-trip still drives ranking. Everything runs on one SIE endpoint. + +Each page also carries a `client` tag, so the same corpus serves multiple +tenants from one index — queries scoped to `acme-corp` cannot retrieve a +`globex` page, no separate index per tenant required. + +## SIE features used + +- `encode` — `vidore/colqwen2.5-v0.2` on page images at ingest and on the + query text at search time. Output is a `[tokens, 128]` multivector. Late + interaction (`sie_sdk.scoring.maxsim`) is the only ranking signal. +- `extract` — `mynkchaudhry/Florence-2-FT-DocVQA`. Called twice, with two + jobs: with `instruction=` to get a textual answer for the + top page, and without `instruction` to OCR the same page for a display + snippet. The OCR snippet is UX-only — it never enters the score path. +- `score` *(optional)* — `Qwen/Qwen3-VL-Reranker-2B` second-stage rerank + over `(query text, page image)`. Off by default while we wait for an + upstream adapter fix; flip `search.visual_rerank: true` in `config.yaml` + to enable it on a cluster that's ready. + +## Why vision end-to-end + +OCR-then-text-rerank throws away the exact signal we pick ColQwen for — +charts, screenshots, tables, callouts, and the spatial layout that tells +a wiki page apart from a checklist. The rerank stays visual or doesn't +happen. The OCR step shows on-screen text next to the page image so the +user can copy/paste from the result, nothing more. + +## Multi-tenant by construction + +Every page carries a `client` field in `data/pages.json`. The metadata list +loaded by `python/search.py` is filtered by `client_name` before MaxSim +runs, so a query scoped to `acme-corp` cannot retrieve a `globex` page. +Real deployments would push `client` down into the multivector store's +filter expression; the demo keeps everything in memory because the corpus +is tiny. + +## Run it + +You need Python 3.12 and a reachable SIE cluster (or local `docker run`). + +```bash +# 1. SIE locally (or point SIE_CLUSTER_URL / SIE_API_KEY at a managed cluster). +docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default + +# 2. Generate the synthetic corpus and render each page to a PNG. +cd examples/vision-doc-rag +pip install -r python/requirements.txt +python data/fetch_dataset.py +python data/render_pages.py + +# 3. Encode every page with ColQwen2.5 and save the multivectors. +python python/ingest.py + +# 4a. CLI demo — runs four scoped queries and prints results. +python python/search.py + +# 4b. Or start the UI. +uvicorn --app-dir python server:app --port 8888 +open http://localhost:8888 +``` + +First run on a cold cluster pays a one-time model load: ColQwen2.5 and +Florence-2 are both several GB, expect roughly a minute on CPU and a few +seconds on GPU before the warm path kicks in. + +### Pointing at a managed cluster + +```bash +export SIE_CLUSTER_URL="https://your-cluster-host:8080" +export SIE_API_KEY="SL-..." +``` + +The defaults in `config.yaml` point at `http://localhost:8080` so the env +vars only matter when you're hitting something remote. Set `cluster.gpu` +to a profile name like `l4-spot` if the cluster needs an explicit GPU +class. + +## Try these queries + +| Tenant | Query | Why it's interesting | +|---|---|---| +| `acme-corp` | how do I sign in to the VPN? | Visual layout match — the page is titled "VPN setup for new engineers" with a bulleted body, and ColQwen2.5 picks it without keyword overlap with "sign in". DocVQA reads the page and answers with the client name and the auth method. | +| `globex` | what is the parental leave policy? | Disambiguates from "time off" — the right page mentions parental leave only halfway down the body. The textual answer cites the week count. | +| `initech` | audit prep evidence and walkthroughs | All three Initech pages are compliance-flavored; the visual model breaks the tie by reading the checklist layout. | +| `globex` | how do I sign in to the VPN? | Tenant filter — even though the same query hit acme-corp earlier, scoping to globex returns the closest globex page (Wi-Fi guide) and never leaks acme content. | + +## API + +### `GET /api/search` + +| Parameter | Required | Description | +|---|---|---| +| `q` | yes | Search query | +| `client` | no | Tenant filter (e.g. `acme-corp`). Omitted ⇒ search runs across all tenants. | + +```bash +curl "http://localhost:8888/api/search?q=how+do+I+sign+in+to+the+VPN&client=acme-corp" +``` + +```json +{ + "query": "how do I sign in to the VPN", + "client": "acme-corp", + "answer": "Okta credentials with Duo Push for 2FA", + "timings": { + "encode_query_s": 0.12, + "maxsim_s": 0.003, + "docvqa_s": 0.91, + "ocr_snippet_s": 0.84 + }, + "results": [ + { + "page_id": "ACME-101", + "client": "acme-corp", + "title": "VPN setup for new engineers", + "space": "Engineering", + "author": "alice@acme", + "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/101", + "page_image": "/pages/ACME-101.png", + "ocr_snippet": "VPN Setup for New Engineers · ...", + "scores": { "maxsim": 14.44, "rerank": null } + } + ] +} +``` + +### `GET /api/clients`, `GET /api/stats` + +Tenant list and runtime config (active models, rerank on/off, page count). + +## How it works + +``` + ┌──────────────────────────────────────────────────────────────┐ + │ ingest.py (once per corpus) │ + │ pages.json ─▶ render_pages.py ─▶ data/pages/*.png │ + │ ─▶ SIE.encode(ColQwen2.5, images, multivector) │ + │ ─▶ data/multivectors.npz + data/metadata.json │ + └──────────────────────────────────────────────────────────────┘ + │ + ▼ + ┌──────────────────────────────────────────────────────────────┐ + │ search.py / server.py (per query) │ + │ q ─▶ SIE.encode(ColQwen2.5, text, is_query=True) │ + │ ─▶ filter metadata by tenant │ + │ ─▶ sie_sdk.scoring.maxsim → top_k_candidates │ + │ ─▶ [optional] SIE.score(Qwen3-VL-Reranker, q, images) │ + │ ─▶ SIE.extract(Florence-2-DocVQA, instruction=q, │ + │ images=[top_page]) ⇒ textual answer │ + │ ─▶ SIE.extract(Florence-2-DocVQA, images=[top_page]) │ + │ ⇒ OCR snippet (UI) │ + └──────────────────────────────────────────────────────────────┘ +``` + +OCR is never on the score path. The visual reranker (when enabled) ranks +over the same modality as retrieval, so layout cues survive both stages. + +The corpus is small enough that MaxSim runs in Python. For thousands of +pages, hand the multivectors to LanceDB or Vespa; only the SIE calls stay +the same. + +## Customize + +`config.yaml` is the single tuning surface: + +```yaml +models: + retriever: "vidore/colqwen2.5-v0.2" # smaller: vidore/colpali-v1.3-hf + docvqa: "mynkchaudhry/Florence-2-FT-DocVQA" + reranker: "Qwen/Qwen3-VL-Reranker-2B" # used only when search.visual_rerank: true +search: + top_k_candidates: 5 + top_k_results: 3 + visual_rerank: false + answer: true + ocr_snippet: true +``` + +Swap any model for another from the +[SIE model catalog](https://superlinked.com/models) and the pipeline keeps +working. + +## Project layout + +```text +examples/vision-doc-rag/ +├── config.yaml +├── data/ +│ ├── fetch_dataset.py # synthetic 3-tenant page corpus +│ ├── render_pages.py # pages.json → PNG screenshots +│ ├── pages.json # generated +│ ├── pages/ # generated PNGs +│ ├── metadata.json # generated by ingest +│ └── multivectors.npz # generated by ingest +├── python/ +│ ├── ingest.py +│ ├── search.py +│ ├── server.py +│ └── requirements.txt +└── static/ + └── index.html +``` diff --git a/examples/vision-doc-rag/config.yaml b/examples/vision-doc-rag/config.yaml new file mode 100644 index 00000000..8b35ffda --- /dev/null +++ b/examples/vision-doc-rag/config.yaml @@ -0,0 +1,43 @@ +# SIE server (defaults to local Docker: docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default). +# Override with SIE_CLUSTER_URL / SIE_API_KEY env vars when targeting a managed cluster. +cluster: + url: "http://localhost:8080" + api_key: "" + gpu: "" # only set for managed multi-GPU clusters (e.g. "l4-spot"); ignored locally + provision_timeout_s: 600 + +# Models. The retrieval signal is vision end-to-end: ColQwen2.5 reads each page +# as an image and we late-interact (MaxSim) against the same model's text-side +# embedding of the query. No OCR is involved in ranking, so charts, screenshots, +# tables, and any other layout cue that wouldn't survive an OCR round-trip +# still contributes to the score. +# +# DocVQA produces a textual answer for the top page. The model takes the page +# image + the user's question (passed via `instruction`) and returns the answer +# as an entity in the response — no separate LLM call needed. +models: + retriever: "vidore/colqwen2.5-v0.2" + docvqa: "mynkchaudhry/Florence-2-FT-DocVQA" + # Optional second-stage cross-encoder rerank. Visual model so we don't have to + # collapse the page through OCR before reranking. Disabled by default while + # we wait for the cluster-side adapter bug to land: + # https://github.com/superlinked/sie-internal/issues/1026 + # Re-enable with search.visual_rerank: true once that ships. + reranker: "Qwen/Qwen3-VL-Reranker-2B" + +# Page rendering (used by data/render_pages.py to turn the synthetic page +# corpus into PNGs; replace with pdf2image, screenshots, or your own files +# for a real deployment). +render: + width: 1024 + height: 1280 + body_font_size: 20 + title_font_size: 30 + +# Retrieval +search: + top_k_candidates: 5 # how many pages survive MaxSim + top_k_results: 3 # how many pages return after optional rerank + visual_rerank: false # see models.reranker note above + answer: true # run DocVQA on the top page for a textual answer + ocr_snippet: true # OCR the top page for a display-only snippet in the UI diff --git a/examples/vision-doc-rag/data/fetch_dataset.py b/examples/vision-doc-rag/data/fetch_dataset.py new file mode 100644 index 00000000..eb901a6c --- /dev/null +++ b/examples/vision-doc-rag/data/fetch_dataset.py @@ -0,0 +1,211 @@ +"""Synthetic multi-tenant page corpus. + +Three fictional clients, each with a handful of pages — engineering runbooks, +HR policies, finance procedures. Small enough to encode in a minute on a warm +GPU cluster, varied enough to make multi-tenant filtering and visual retrieval +meaningful. Replace `PAGES` with your own pages (wiki export, Notion dump, +PDF batch, etc.) to point the demo at real content. +""" + +import json +from pathlib import Path + +PAGES = [ + # ── acme-corp: engineering ──────────────────────────────────────────── + { + "client": "acme-corp", + "page_id": "ACME-101", + "title": "VPN setup for new engineers", + "space": "Engineering", + "author": "alice@acme", + "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/101", + "body": [ + "All engineers need to connect through the corporate VPN to reach internal services.", + "We use Cisco AnyConnect on macOS and Windows, and the OpenConnect CLI on Linux.", + "Download the client from it.acme.com/vpn, then sign in with your Okta credentials.", + "Two-factor confirmation goes through Duo Push.", + "If you hit a TLS error on first connection, check that the device certificate from Jamf is installed.", + "For on-call rotations, request the always-on VPN profile from IT — it auto-reconnects after suspend.", + ], + }, + { + "client": "acme-corp", + "page_id": "ACME-102", + "title": "On-call rotation and paging", + "space": "Engineering", + "author": "bob@acme", + "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/102", + "body": [ + "Engineering on-call runs Monday to Monday handovers at 10:00 PT.", + "Primary takes the pager, secondary takes the laptop, both are paid the on-call stipend.", + "Pages route through PagerDuty; the escalation policy is primary -> secondary (15 min) -> manager.", + "During an incident open a Zoom bridge and a Slack channel named #inc-YYYYMMDD-summary.", + "Postmortems are due within five working days and live in the Incidents space.", + ], + }, + { + "client": "acme-corp", + "page_id": "ACME-103", + "title": "Deploying to production with our CI/CD pipeline", + "space": "Engineering", + "author": "carol@acme", + "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/103", + "body": [ + "We use GitHub Actions for CI and ArgoCD for delivery to Kubernetes.", + "Merging to main triggers a build, runs the test suite, pushes an image to ECR, and updates the staging manifest.", + "Production rollouts are gated by a manual approval in ArgoCD and require two reviewers from the service team.", + "Use the rolling strategy with maxSurge=25% by default.", + "Hotfix tags follow the pattern v1.2.3-hotfix.N and skip staging only with on-call approval recorded in the PR.", + ], + }, + { + "client": "acme-corp", + "page_id": "ACME-104", + "title": "Local development setup", + "space": "Engineering", + "author": "dan@acme", + "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/104", + "body": [ + "Install mise to manage runtimes — it pins Node, Python, and Go versions per repo.", + "Run `mise install` in the repo root, then `make dev` to spin up Postgres, Redis, and the API gateway in Docker.", + "The seed data covers the last 30 days of staging traffic, sanitized of PII.", + "If port 5432 is already taken, override DEV_PG_PORT in your shell profile.", + ], + }, + # ── globex: HR and admin ────────────────────────────────────────────── + { + "client": "globex", + "page_id": "GLOBEX-201", + "title": "Time off and vacation policy", + "space": "HR", + "author": "hr@globex", + "web_url": "https://globex.atlassian.net/wiki/spaces/HR/pages/201", + "body": [ + "Globex offers 25 working days of paid vacation per year, accruing monthly from the start date.", + "Requests go through Workday at least two weeks in advance for absences longer than three days.", + "Sick leave is separate and uncapped, but anything over three consecutive days requires a doctor's note.", + "Parental leave is 18 weeks at full pay for the primary caregiver and 6 weeks for the secondary, regardless of gender.", + "Unused vacation rolls over up to 10 days into the next calendar year; the rest is paid out.", + ], + }, + { + "client": "globex", + "page_id": "GLOBEX-202", + "title": "Expense reports and reimbursement", + "space": "HR", + "author": "finance@globex", + "web_url": "https://globex.atlassian.net/wiki/spaces/HR/pages/202", + "body": [ + "Submit expenses in Expensify within 30 days of the transaction.", + "Receipts are mandatory for any item over $25; below that, a description and category are enough.", + "Travel bookings should go through Navan when possible — direct bookings need pre-approval from your manager.", + "Reimbursements process every Friday and land in your payroll account the following Tuesday.", + "Per diem for international travel is $80 USD equivalent for meals.", + ], + }, + { + "client": "globex", + "page_id": "GLOBEX-203", + "title": "Office perks and meals", + "space": "HR", + "author": "office@globex", + "web_url": "https://globex.atlassian.net/wiki/spaces/HR/pages/203", + "body": [ + "Lunch is catered Monday through Thursday in the main cafe from 12:00 to 14:00.", + "There are always vegetarian, vegan, and gluten-free options labeled at the buffet.", + "Friday is a free-lunch credit you can spend at any partner restaurant in the office app.", + "Snacks and drinks in the micro-kitchens are unlimited; please refill empty trays.", + "The wellness stipend is $100 per month, claimable in Expensify under category Wellness.", + ], + }, + { + "client": "globex", + "page_id": "GLOBEX-204", + "title": "Office Wi-Fi and guest network", + "space": "IT", + "author": "it@globex", + "web_url": "https://globex.atlassian.net/wiki/spaces/IT/pages/204", + "body": [ + "Connect to Globex-Corp for the employee network; sign in with your @globex.com SSO.", + "Globex-Guest is for visitors — the rotating daily password is on the lobby screen.", + "Printing requires the Globex-Print network and a one-time pairing with your laptop using the Mobility Print app.", + "If your laptop will not join, forget the network and rejoin; the cert is renewed weekly and old caches get stuck.", + ], + }, + # ── initech: finance and compliance ─────────────────────────────────── + { + "client": "initech", + "page_id": "INIT-301", + "title": "SOX controls and quarterly attestation", + "space": "Compliance", + "author": "compliance@initech", + "web_url": "https://initech.atlassian.net/wiki/spaces/COMP/pages/301", + "body": [ + "Initech is subject to SOX 404 reporting for financial controls over revenue, expense, and access management.", + "Every quarter, control owners attest in AuditBoard that their controls operated as designed.", + "Evidence is automatically collected from Workday, NetSuite, and Okta where possible; manual evidence goes in the AuditBoard Drive folder.", + "External auditors test a sample of controls in Q3; expect requests for screenshots and approver lists.", + "Exceptions must be logged within five business days of detection.", + ], + }, + { + "client": "initech", + "page_id": "INIT-302", + "title": "Vendor onboarding and due diligence", + "space": "Procurement", + "author": "procurement@initech", + "web_url": "https://initech.atlassian.net/wiki/spaces/PROC/pages/302", + "body": [ + "New vendors above $50,000 annual spend require a security review and a SOC 2 Type II report on file.", + "Submit the vendor questionnaire through Vanta; legal will review the MSA within five business days.", + "Payment terms default to Net 60; faster terms require CFO approval and reduce the risk score in NetSuite.", + "Sanctioned-country checks run automatically via the OFAC integration; any hit halts the workflow until cleared.", + "Annual recertification of high-risk vendors happens every January.", + ], + }, + { + "client": "initech", + "page_id": "INIT-303", + "title": "Audit prep checklist", + "space": "Compliance", + "author": "audit@initech", + "web_url": "https://initech.atlassian.net/wiki/spaces/COMP/pages/303", + "body": [ + "Two weeks before the auditors arrive, freeze the control population in AuditBoard and export the evidence index.", + "Confirm with control owners that they will be available for walkthrough interviews — block 60 minutes in their calendars.", + "Pull the user access review reports for the prior two quarters from Okta and confirm sign-off in writing.", + "Have the change management JIRA queries ready: filter by label sox-relevant and status Done.", + "If a control failed mid-period, document the compensating control and the date the gap was closed.", + ], + }, + { + "client": "initech", + "page_id": "INIT-304", + "title": "Procurement card limits and exceptions", + "space": "Procurement", + "author": "procurement@initech", + "web_url": "https://initech.atlassian.net/wiki/spaces/PROC/pages/304", + "body": [ + "Procurement cards (P-cards) have a default monthly limit of $5,000 and a single-transaction limit of $1,500.", + "Use them for low-dollar, low-risk purchases — software subscriptions and conference tickets are the common cases.", + "Limit-increase requests need manager and CFO approval and a documented business need.", + "Personal use, cash advances, and split transactions to bypass the single-transaction limit are policy violations.", + "All P-card transactions reconcile in Coupa within 14 days of statement close.", + ], + }, +] + + +def main(): + out = Path(__file__).resolve().parent / "pages.json" + out.write_text(json.dumps(PAGES, indent=2)) + by_client = {} + for p in PAGES: + by_client[p["client"]] = by_client.get(p["client"], 0) + 1 + print(f"Wrote {len(PAGES)} pages to {out}") + for client, n in sorted(by_client.items()): + print(f" {client}: {n} pages") + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/data/render_pages.py b/examples/vision-doc-rag/data/render_pages.py new file mode 100644 index 00000000..4043d71b --- /dev/null +++ b/examples/vision-doc-rag/data/render_pages.py @@ -0,0 +1,106 @@ +"""Render the synthetic pages to PNG screenshots. + +Each entry in pages.json becomes one image in data/pages/.png. The +layout is intentionally plain — a title, a metadata line, and a body block — +so ColQwen2.5 sees the same kind of visual structure it would in real wikis, +docs, or PDFs. Replace this script with `pdf2image` (or screenshots) when +pointing at real content. +""" + +import json +import sys +from pathlib import Path + +import yaml +from PIL import Image, ImageDraw, ImageFont + + +def _font(size: int): + """Try the platform Helvetica, fall back to PIL's default bitmap font.""" + for path in [ + "/System/Library/Fonts/Helvetica.ttc", + "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", + "/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf", + ]: + if Path(path).exists(): + return ImageFont.truetype(path, size) + return ImageFont.load_default() + + +def _wrap(text: str, font: ImageFont.ImageFont, max_width: int) -> list[str]: + """Greedy word wrap so body paragraphs fit the page width.""" + lines: list[str] = [] + for paragraph in text.split("\n"): + words = paragraph.split() + current = "" + for word in words: + candidate = f"{current} {word}".strip() + if font.getlength(candidate) <= max_width: + current = candidate + else: + if current: + lines.append(current) + current = word + if current: + lines.append(current) + return lines + + +def render_page(page: dict, width: int, height: int, body_size: int, title_size: int) -> Image.Image: + img = Image.new("RGB", (width, height), "white") + draw = ImageDraw.Draw(img) + title_font = _font(title_size) + meta_font = _font(int(body_size * 0.9)) + body_font = _font(body_size) + + margin = 48 + cursor_y = margin + draw.text((margin, cursor_y), page["title"], fill="black", font=title_font) + cursor_y += int(title_size * 1.6) + meta = f"{page['space']} · {page['author']} · {page['page_id']}" + draw.text((margin, cursor_y), meta, fill=(96, 96, 96), font=meta_font) + cursor_y += int(title_size * 1.2) + draw.line([(margin, cursor_y), (width - margin, cursor_y)], fill=(200, 200, 200), width=2) + cursor_y += int(body_size * 1.2) + + max_text_width = width - 2 * margin + line_gap = int(body_size * 1.5) + for bullet in page["body"]: + # Render each body line as a wrapped paragraph block. + lines = _wrap(bullet, body_font, max_text_width) + for line in lines: + draw.text((margin, cursor_y), line, fill="black", font=body_font) + cursor_y += line_gap + cursor_y += int(line_gap * 0.4) # paragraph spacing + + return img + + +def main(): + here = Path(__file__).resolve().parent + pages_path = here / "pages.json" + if not pages_path.exists(): + print("pages.json not found; run fetch_dataset.py first", file=sys.stderr) + sys.exit(1) + config = yaml.safe_load((here.parent / "config.yaml").read_text()) + render = config["render"] + out_dir = here / "pages" + out_dir.mkdir(exist_ok=True) + + pages = json.loads(pages_path.read_text()) + for p in pages: + img = render_page( + p, + width=render["width"], + height=render["height"], + body_size=render["body_font_size"], + title_size=render["title_font_size"], + ) + out = out_dir / f"{p['page_id']}.png" + img.save(out) + print(f" {p['client']:10s} {p['page_id']:10s} -> {out.relative_to(here.parent)}") + print(f"Rendered {len(pages)} pages to {out_dir}") + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/python/ingest.py b/examples/vision-doc-rag/python/ingest.py new file mode 100644 index 00000000..15607f30 --- /dev/null +++ b/examples/vision-doc-rag/python/ingest.py @@ -0,0 +1,119 @@ +"""Build the per-tenant visual index. + +For every page PNG we ask SIE to encode the image with vidore/colqwen2.5-v0.2, +which returns a [tokens, 128] multivector. Each page's multivector goes into a +single .npz on disk, alongside a metadata.json that keeps the client name, +page id, title, and source url for routing and filtering at query time. + +There is no vector database here. MaxSim at the scale of one team's wiki +(hundreds to thousands of pages) is cheap and avoids the indexing step. +For larger corpora swap the .npz for a multivector store (LanceDB, Vespa, +Turbopuffer); the encode call is the same. +""" + +from __future__ import annotations + +import json +import os +import time +from pathlib import Path + +import numpy as np +import yaml + +from sie_sdk import SIEClient +from sie_sdk.types import Item + + +def load_config(): + return yaml.safe_load((Path(__file__).resolve().parent.parent / "config.yaml").read_text()) + + +def load_pages(): + pages_path = Path(__file__).resolve().parent.parent / "data" / "pages.json" + if not pages_path.exists(): + raise FileNotFoundError( + "data/pages.json not found. Run `python data/fetch_dataset.py` " + "and `python data/render_pages.py` first." + ) + return json.loads(pages_path.read_text()) + + +def encode_pages(client: SIEClient, model: str, pages: list[dict], gpu: str, timeout: float): + pages_dir = Path(__file__).resolve().parent.parent / "data" / "pages" + multivectors: list[np.ndarray] = [] + metadata: list[dict] = [] + + for i, page in enumerate(pages, 1): + image_path = pages_dir / f"{page['page_id']}.png" + if not image_path.exists(): + raise FileNotFoundError(f"Missing page image: {image_path}. Run data/render_pages.py.") + + start = time.time() + result = client.encode( + model, + Item(id=page["page_id"], images=[str(image_path)]), + output_types=["multivector"], + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + elapsed = time.time() - start + mv = result["multivector"].astype(np.float32) + multivectors.append(mv) + metadata.append( + { + "page_id": page["page_id"], + "client": page["client"], + "title": page["title"], + "space": page["space"], + "author": page["author"], + "web_url": page["web_url"], + "image_path": str(image_path.relative_to(image_path.parent.parent.parent)), + "num_tokens": int(mv.shape[0]), + } + ) + print(f" [{i}/{len(pages)}] {page['page_id']:10s} {page['client']:10s} {mv.shape} in {elapsed:.1f}s") + + return multivectors, metadata + + +def main(): + config = load_config() + pages = load_pages() + print(f"Loaded {len(pages)} pages") + + cluster_url = os.environ.get("SIE_CLUSTER_URL", config["cluster"]["url"]) + api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) + gpu = config["cluster"]["gpu"] + timeout = config["cluster"]["provision_timeout_s"] + model = config["models"]["retriever"] + + print(f"\n--- Encoding pages with {model} ---") + with SIEClient(cluster_url, api_key=api_key) as client: + multivectors, metadata = encode_pages(client, model, pages, gpu, timeout) + + data_dir = Path(__file__).resolve().parent.parent / "data" + # np.savez stores variable-length multivectors as one entry per array; we + # key them by page_id so the search side can reload without an extra index. + np.savez( + data_dir / "multivectors.npz", + **{m["page_id"]: mv for m, mv in zip(metadata, multivectors)}, + ) + (data_dir / "metadata.json").write_text(json.dumps(metadata, indent=2)) + + total_tokens = sum(m["num_tokens"] for m in metadata) + by_client: dict[str, int] = {} + for m in metadata: + by_client[m["client"]] = by_client.get(m["client"], 0) + 1 + + print(f"\n Saved {len(metadata)} multivectors to data/multivectors.npz") + print(f" Saved metadata to data/metadata.json") + print(f" Total visual tokens: {total_tokens}") + print(" Pages per tenant:") + for client_name in sorted(by_client): + print(f" {client_name}: {by_client[client_name]}") + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/python/requirements.txt b/examples/vision-doc-rag/python/requirements.txt new file mode 100644 index 00000000..bd32dcbc --- /dev/null +++ b/examples/vision-doc-rag/python/requirements.txt @@ -0,0 +1,6 @@ +sie-sdk==0.1.10 +fastapi>=0.115.0 +uvicorn>=0.30.0 +numpy>=1.26.0 +pyyaml>=6.0 +Pillow>=10.3.0 diff --git a/examples/vision-doc-rag/python/search.py b/examples/vision-doc-rag/python/search.py new file mode 100644 index 00000000..52dd2211 --- /dev/null +++ b/examples/vision-doc-rag/python/search.py @@ -0,0 +1,243 @@ +"""Visual document search + question answering, vision end-to-end. + +Pipeline per query: + 1. encode(ColQwen2.5, text) — query multivector + 2. sie_sdk.scoring.maxsim — late interaction against page images + 3. score(Qwen3-VL-Reranker, query, images) — optional, off by default + 4. extract(Florence-2-FT-DocVQA, instruction=query, images=[top page]) + — textual answer + citation + 5. extract(Florence-2-FT-DocVQA, images=[top page]) + — OCR snippet for the UI (display only, + NOT in the ranking path) + +The ranking is decided by a vision model looking at the page image, so charts, +screenshots, tables, and any other visual signal that OCR would erase still +contributes. OCR runs only on the chosen page, only to provide on-screen text +the user can read or copy. + +Multi-tenant isolation is a Python filter on metadata before MaxSim, so a +query scoped to one client never sees another client's pages. +""" + +from __future__ import annotations + +import json +import os +import time +from pathlib import Path + +import numpy as np +import yaml + +from sie_sdk import SIEClient +from sie_sdk.scoring import maxsim +from sie_sdk.types import Item + + +def load_config(): + return yaml.safe_load((Path(__file__).resolve().parent.parent / "config.yaml").read_text()) + + +def load_index(): + data_dir = Path(__file__).resolve().parent.parent / "data" + if not (data_dir / "multivectors.npz").exists(): + raise FileNotFoundError("data/multivectors.npz missing. Run `python python/ingest.py` first.") + npz = np.load(data_dir / "multivectors.npz") + metadata = json.loads((data_dir / "metadata.json").read_text()) + multivectors = {m["page_id"]: npz[m["page_id"]] for m in metadata} + return multivectors, metadata + + +def _ocr_snippet(entities: list[dict], max_chars: int = 400) -> str: + """Concatenate OCR text regions into a single readable snippet.""" + pieces = [] + for e in entities or []: + text = (e.get("text") or "").replace("", "").strip() + if text: + pieces.append(text) + joined = " · ".join(pieces) + if len(joined) > max_chars: + return joined[: max_chars - 1] + "…" + return joined + + +def _docvqa_answer(entities: list[dict]) -> str: + """Pick the answer string out of a Florence-2 DocVQA response. + + Florence-2 returns the answer as an entity (often the single one when the + `` task token is dispatched). We take the first non-empty text. + """ + for e in entities or []: + text = (e.get("text") or "").replace("", "").strip() + if text: + return text + return "" + + +def search( + client: SIEClient, + config: dict, + multivectors: dict[str, np.ndarray], + metadata: list[dict], + query: str, + client_filter: str | None = None, +) -> dict: + gpu = config["cluster"]["gpu"] + timeout = config["cluster"]["provision_timeout_s"] + top_k_candidates = config["search"]["top_k_candidates"] + top_k_results = config["search"]["top_k_results"] + do_visual_rerank = config["search"].get("visual_rerank", False) + do_answer = config["search"].get("answer", True) + do_ocr_snippet = config["search"].get("ocr_snippet", True) + + corpus = [m for m in metadata if not client_filter or m["client"] == client_filter] + if not corpus: + return {"results": [], "answer": None, "timings": {}} + + timings: dict[str, float] = {} + pages_root = Path(__file__).resolve().parent.parent / "data" + + # 1. Encode query (text side of ColQwen2.5). + t0 = time.time() + q_result = client.encode( + config["models"]["retriever"], + Item(text=query), + output_types=["multivector"], + is_query=True, + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["encode_query_s"] = round(time.time() - t0, 3) + query_mv = q_result["multivector"].astype(np.float32) + + # 2. MaxSim against in-memory multivectors. + doc_mvs = [multivectors[m["page_id"]] for m in corpus] + t0 = time.time() + maxsim_scores = maxsim(query_mv, doc_mvs) + timings["maxsim_s"] = round(time.time() - t0, 3) + + order = np.argsort(maxsim_scores)[::-1][:top_k_candidates] + candidates: list[dict] = [] + for idx in order: + c = dict(corpus[idx]) + c["_maxsim_score"] = float(maxsim_scores[idx]) + c["_rerank_score"] = None + candidates.append(c) + + # 3. Optional visual rerank. Image-in cross-encoder so OCR never enters the + # ranking path. Disabled by default — see config.yaml for the cluster + # bug we're waiting on. + if do_visual_rerank and candidates: + try: + t0 = time.time() + rerank_items = [ + Item(id=c["page_id"], images=[str(pages_root / c["image_path"])]) + for c in candidates + ] + rerank = client.score( + config["models"]["reranker"], + Item(text=query), + rerank_items, + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["visual_rerank_s"] = round(time.time() - t0, 3) + rerank_by_id = {s["item_id"]: s for s in rerank["scores"]} + for c in candidates: + s = rerank_by_id.get(c["page_id"]) + c["_rerank_score"] = float(s["score"]) if s else 0.0 + candidates.sort(key=lambda c: c["_rerank_score"] or 0.0, reverse=True) + except Exception as exc: + # Cluster adapter bug fallback: keep MaxSim ordering, surface the + # failure to the caller. See sie-internal#1026. + timings["visual_rerank_error"] = type(exc).__name__ + + results = candidates[:top_k_results] + + # 4. DocVQA answer from the top page image. instruction= goes in as the + # plain question; the adapter prepends Florence-2's `` task + # token. See superlinked.com/docs/extract/vision. + answer = None + if do_answer and results: + top = results[0] + try: + t0 = time.time() + qa = client.extract( + config["models"]["docvqa"], + Item(images=[str(pages_root / top["image_path"])]), + instruction=query, + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["docvqa_s"] = round(time.time() - t0, 3) + answer = _docvqa_answer(qa[0].get("entities", []) if qa else []) + except Exception as exc: + timings["docvqa_error"] = type(exc).__name__ + + # 5. OCR snippet for display — only on the top result so users see the + # text on the page they're being shown. Never used as a ranking signal. + if do_ocr_snippet and results: + top = results[0] + try: + t0 = time.time() + ocr = client.extract( + config["models"]["docvqa"], # same model, no `instruction` ⇒ OCR mode + Item(images=[str(pages_root / top["image_path"])]), + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["ocr_snippet_s"] = round(time.time() - t0, 3) + top["ocr_snippet"] = _ocr_snippet(ocr[0].get("entities", []) if ocr else []) + except Exception as exc: + timings["ocr_snippet_error"] = type(exc).__name__ + + return {"results": results, "answer": answer, "timings": timings} + + +def print_run(out: dict, query: str, client_filter: str | None): + scope = client_filter or "all clients" + print(f'\n Query: "{query}" ({scope})') + print(f" Timings: {out['timings']}") + if out["answer"]: + print(f"\n Answer: {out['answer']}") + if not out["results"]: + print(" No results.") + return + for i, r in enumerate(out["results"], 1): + rerank = r.get("_rerank_score") + rerank_str = f"rerank={rerank:.4f}" if rerank is not None else "rerank=—" + print(f"\n {i}. [{r['client']}] {r['title']}") + print(f" {r['page_id']} · {r['space']} · {r['author']}") + print(f" maxsim={r['_maxsim_score']:.3f} {rerank_str}") + if r.get("ocr_snippet"): + print(f" OCR snippet: {r['ocr_snippet'][:200]}") + print(f" url: {r['web_url']}") + + +def main(): + config = load_config() + multivectors, metadata = load_index() + print(f"Loaded index: {len(metadata)} pages") + + cluster_url = os.environ.get("SIE_CLUSTER_URL", config["cluster"]["url"]) + api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) + + demo = [ + ("how do I sign in to the VPN?", "acme-corp"), + ("what is the parental leave policy?", "globex"), + ("audit prep evidence and walkthroughs", "initech"), + # No tenant filter: shows the query routes across tenants. + ("expense reports and per diem", None), + ] + with SIEClient(cluster_url, api_key=api_key) as client: + for query, tenant in demo: + out = search(client, config, multivectors, metadata, query, tenant) + print_run(out, query, tenant) + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/python/server.py b/examples/vision-doc-rag/python/server.py new file mode 100644 index 00000000..d61e5962 --- /dev/null +++ b/examples/vision-doc-rag/python/server.py @@ -0,0 +1,96 @@ +"""FastAPI backend for the multi-tenant visual-document search + QA demo.""" + +from __future__ import annotations + +import os +from contextlib import asynccontextmanager +from pathlib import Path + +import yaml +from fastapi import FastAPI, Query +from fastapi.responses import FileResponse +from fastapi.staticfiles import StaticFiles + +from sie_sdk import SIEClient + +from search import load_index, search + +config = None +multivectors = None +metadata = None +client = None +clients_index: list[str] = [] + + +@asynccontextmanager +async def lifespan(app: FastAPI): + global config, multivectors, metadata, client, clients_index + root = Path(__file__).resolve().parent.parent + config = yaml.safe_load((root / "config.yaml").read_text()) + multivectors, metadata = load_index() + cluster_url = os.environ.get("SIE_CLUSTER_URL", config["cluster"]["url"]) + api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) + client = SIEClient(cluster_url, api_key=api_key) + clients_index = sorted({m["client"] for m in metadata}) + yield + client.close() + + +app = FastAPI(title="SIE Vision-First Document RAG", lifespan=lifespan) + +root = Path(__file__).resolve().parent.parent +static_dir = root / "static" +app.mount("/static", StaticFiles(directory=str(static_dir)), name="static") +app.mount("/pages", StaticFiles(directory=str(root / "data" / "pages")), name="pages") + + +@app.get("/") +def index(): + return FileResponse(str(static_dir / "index.html")) + + +@app.get("/api/clients") +def api_clients(): + return clients_index + + +@app.get("/api/stats") +def api_stats(): + return { + "total_pages": len(metadata), + "clients": clients_index, + "models": config["models"], + "visual_rerank": config["search"].get("visual_rerank", False), + "answer": config["search"].get("answer", True), + } + + +@app.get("/api/search") +def api_search( + q: str = Query(..., min_length=1), + client_name: str | None = Query(None, alias="client"), +): + out = search(client, config, multivectors, metadata, q, client_name) + return { + "query": q, + "client": client_name, + "answer": out["answer"], + "timings": out["timings"], + "results": [ + { + "page_id": r["page_id"], + "client": r["client"], + "title": r["title"], + "space": r["space"], + "author": r["author"], + "web_url": r["web_url"], + "page_image": f"/pages/{r['page_id']}.png", + "ocr_snippet": r.get("ocr_snippet", ""), + "scores": { + "maxsim": round(r["_maxsim_score"], 4), + "rerank": round(r["_rerank_score"], 4) if r.get("_rerank_score") is not None else None, + }, + } + for r in out["results"] + ], + } diff --git a/examples/vision-doc-rag/static/index.html b/examples/vision-doc-rag/static/index.html new file mode 100644 index 00000000..392c8791 --- /dev/null +++ b/examples/vision-doc-rag/static/index.html @@ -0,0 +1,190 @@ + + + + + + Vision-First Document RAG · SIE + + + +
+

Multi-Tenant Visual Doc Search + QA

+

ColQwen2.5 ranks pages by looking at the images. Florence-2-DocVQA reads the top page and answers the question. All on one SIE endpoint.

+
+
+
+ + + +
+
+
+
+
+ + + From 2ba8efc3c86ccb402a650e6127453825ba19afe8 Mon Sep 17 00:00:00 2001 From: svonava Date: Sun, 24 May 2026 09:55:21 -0700 Subject: [PATCH 2/3] feat(examples): use real PDFs for vision doc RAG --- examples/vision-doc-rag/.gitignore | 3 + examples/vision-doc-rag/README.md | 240 ++++++++++-------- examples/vision-doc-rag/config.yaml | 11 +- examples/vision-doc-rag/data/fetch_dataset.py | 211 --------------- examples/vision-doc-rag/data/fetch_pdfs.py | 158 ++++++++++++ examples/vision-doc-rag/data/render_pages.py | 223 +++++++++------- examples/vision-doc-rag/python/ingest.py | 31 ++- .../vision-doc-rag/python/requirements.txt | 4 +- examples/vision-doc-rag/python/search.py | 25 +- examples/vision-doc-rag/python/server.py | 11 +- examples/vision-doc-rag/static/index.html | 21 +- 11 files changed, 487 insertions(+), 451 deletions(-) delete mode 100644 examples/vision-doc-rag/data/fetch_dataset.py create mode 100644 examples/vision-doc-rag/data/fetch_pdfs.py diff --git a/examples/vision-doc-rag/.gitignore b/examples/vision-doc-rag/.gitignore index a787e920..9a052846 100644 --- a/examples/vision-doc-rag/.gitignore +++ b/examples/vision-doc-rag/.gitignore @@ -1,6 +1,9 @@ .venv/ __pycache__/ data/pages.json +data/pdfs_manifest.json +data/pages_manifest.json +data/pdfs/ data/pages/ data/multivectors.npz data/metadata.json diff --git a/examples/vision-doc-rag/README.md b/examples/vision-doc-rag/README.md index f179051c..f2ca3775 100644 --- a/examples/vision-doc-rag/README.md +++ b/examples/vision-doc-rag/README.md @@ -1,64 +1,70 @@ # Vision-first document RAG -Retrieve by image, answer by image. ColQwen2.5 reads each page as a picture -and ranks them via late interaction; Florence-2-DocVQA reads the winning -page and produces the textual answer. OCR never enters the score path, so -charts, screenshots, tables, and any other layout cue that would die in a -text round-trip still drives ranking. Everything runs on one SIE endpoint. +Retrieve by image, answer by image. ColQwen2.5 reads each PDF page as a +picture and ranks pages via late interaction; Florence-2-DocVQA reads the +winning page and produces the textual answer. OCR never enters the score path, +so schematics, pinout diagrams, architecture slides, charts, and other layout +cues still drive ranking. Everything runs on one SIE endpoint. Each page also carries a `client` tag, so the same corpus serves multiple -tenants from one index — queries scoped to `acme-corp` cannot retrieve a -`globex` page, no separate index per tenant required. +tenants from one index. Queries scoped to `embedded-lab` cannot retrieve +`ops-eng` or `aerospace` pages. + +## Corpus + +The demo fetches a small public PDF batch on demand and renders selected pages +to PNGs. The page selections are deliberately capped so local ingest stays +fast while still indexing visually rich pages. + +| Tenant | Sources | Visual signal | +|---|---|---| +| `embedded-lab` | Raspberry Pi Pico datasheet, Arduino UNO R3 datasheet, Arduino UNO R3 schematic | Pinout diagrams, board diagrams, circuit schematics | +| `ops-eng` | PostgreSQL manual, CNCF Kubernetes / cloud-native architecture material | Architecture diagrams, operational flows, dense technical tables | +| `aerospace` | NASA NTRS nozzle and booster reports | Engineering drawings, cross-sections, charts, mission technical figures | + +Generated files are ignored: + +```text +data/pdfs/ # downloaded PDFs +data/pdfs_manifest.json # source manifest from fetch_pdfs.py +data/pages/ # rendered PNG pages +data/pages_manifest.json # page-level metadata from render_pages.py +data/metadata.json # index metadata from ingest.py +data/multivectors.npz # page multivectors from ingest.py +``` ## SIE features used -- `encode` — `vidore/colqwen2.5-v0.2` on page images at ingest and on the - query text at search time. Output is a `[tokens, 128]` multivector. Late - interaction (`sie_sdk.scoring.maxsim`) is the only ranking signal. -- `extract` — `mynkchaudhry/Florence-2-FT-DocVQA`. Called twice, with two - jobs: with `instruction=` to get a textual answer for the - top page, and without `instruction` to OCR the same page for a display - snippet. The OCR snippet is UX-only — it never enters the score path. -- `score` *(optional)* — `Qwen/Qwen3-VL-Reranker-2B` second-stage rerank - over `(query text, page image)`. Off by default while we wait for an - upstream adapter fix; flip `search.visual_rerank: true` in `config.yaml` - to enable it on a cluster that's ready. - -## Why vision end-to-end - -OCR-then-text-rerank throws away the exact signal we pick ColQwen for — -charts, screenshots, tables, callouts, and the spatial layout that tells -a wiki page apart from a checklist. The rerank stays visual or doesn't -happen. The OCR step shows on-screen text next to the page image so the -user can copy/paste from the result, nothing more. - -## Multi-tenant by construction - -Every page carries a `client` field in `data/pages.json`. The metadata list -loaded by `python/search.py` is filtered by `client_name` before MaxSim -runs, so a query scoped to `acme-corp` cannot retrieve a `globex` page. -Real deployments would push `client` down into the multivector store's -filter expression; the demo keeps everything in memory because the corpus -is tiny. +- `encode` - `vidore/colqwen2.5-v0.2` on page images at ingest and on query + text at search time. Output is a `[tokens, 128]` multivector. Late + interaction (`sie_sdk.scoring.maxsim`) is the first-stage ranking signal. +- `extract` - `mynkchaudhry/Florence-2-FT-DocVQA`. Called with + `instruction=` to get a textual answer for the top page, and + without `instruction` to OCR the same page for a display snippet. The OCR + snippet is UX-only; it never enters ranking. +- `score` optional - `Qwen/Qwen3-VL-Reranker-2B` second-stage rerank over + `(query text, page image)`. Off by default while we wait for an upstream + adapter fix; flip `search.visual_rerank: true` in `config.yaml` to enable it + on a cluster that's ready. ## Run it -You need Python 3.12 and a reachable SIE cluster (or local `docker run`). +You need Python 3.12 and a reachable SIE cluster. ```bash -# 1. SIE locally (or point SIE_CLUSTER_URL / SIE_API_KEY at a managed cluster). +# 1. SIE locally, or point SIE_CLUSTER_URL / SIE_API_KEY at a managed cluster. docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default -# 2. Generate the synthetic corpus and render each page to a PNG. +# 2. Fetch public PDFs and render selected pages to PNG. cd examples/vision-doc-rag pip install -r python/requirements.txt -python data/fetch_dataset.py +python data/fetch_pdfs.py python data/render_pages.py -# 3. Encode every page with ColQwen2.5 and save the multivectors. +# 3. Encode every rendered page with ColQwen2.5 and save the multivectors. python python/ingest.py -# 4a. CLI demo — runs four scoped queries and prints results. +# 4a. CLI demo. python python/search.py # 4b. Or start the UI. @@ -66,30 +72,34 @@ uvicorn --app-dir python server:app --port 8888 open http://localhost:8888 ``` -First run on a cold cluster pays a one-time model load: ColQwen2.5 and -Florence-2 are both several GB, expect roughly a minute on CPU and a few +`render_pages.py` uses `pdf2image` when Poppler is available. If Poppler is +not installed, it falls back to PyMuPDF, which is installed from +`python/requirements.txt`. + +First run on a cold cluster pays a one-time model load. ColQwen2.5 and +Florence-2 are both several GB, so expect roughly a minute on CPU and a few seconds on GPU before the warm path kicks in. -### Pointing at a managed cluster +### Managed cluster ```bash export SIE_CLUSTER_URL="https://your-cluster-host:8080" export SIE_API_KEY="SL-..." ``` -The defaults in `config.yaml` point at `http://localhost:8080` so the env -vars only matter when you're hitting something remote. Set `cluster.gpu` -to a profile name like `l4-spot` if the cluster needs an explicit GPU -class. +The defaults in `config.yaml` point at `http://localhost:8080`. Set +`cluster.gpu` to a profile name like `l4-spot` if the cluster needs an +explicit GPU class. ## Try these queries | Tenant | Query | Why it's interesting | |---|---|---| -| `acme-corp` | how do I sign in to the VPN? | Visual layout match — the page is titled "VPN setup for new engineers" with a bulleted body, and ColQwen2.5 picks it without keyword overlap with "sign in". DocVQA reads the page and answers with the client name and the auth method. | -| `globex` | what is the parental leave policy? | Disambiguates from "time off" — the right page mentions parental leave only halfway down the body. The textual answer cites the week count. | -| `initech` | audit prep evidence and walkthroughs | All three Initech pages are compliance-flavored; the visual model breaks the tie by reading the checklist layout. | -| `globex` | how do I sign in to the VPN? | Tenant filter — even though the same query hit acme-corp earlier, scoping to globex returns the closest globex page (Wi-Fi guide) and never leaks acme content. | +| `embedded-lab` | Raspberry Pi Pico pinout GP21 | Should land on a pinout/table page even when the visual label is abbreviated. | +| `embedded-lab` | where is the ATmega16U2 on the schematic? | Circuit schematic retrieval, not prose retrieval. | +| `ops-eng` | cloud native architecture diagram | Finds a visual architecture page or slide instead of relying on OCR text only. | +| `aerospace` | solid rocket motor nozzle design figure | Targets an engineering drawing or figure-heavy report page. | +| `ops-eng` | Raspberry Pi Pico pinout GP21 | Tenant filter: the query cannot leak embedded-lab pages when scoped to ops-eng. | ## API @@ -98,33 +108,27 @@ class. | Parameter | Required | Description | |---|---|---| | `q` | yes | Search query | -| `client` | no | Tenant filter (e.g. `acme-corp`). Omitted ⇒ search runs across all tenants. | +| `client` | no | Tenant filter, for example `embedded-lab`. Omitted means search all tenants. | ```bash -curl "http://localhost:8888/api/search?q=how+do+I+sign+in+to+the+VPN&client=acme-corp" +curl "http://localhost:8888/api/search?q=Raspberry+Pi+Pico+pinout+GP21&client=embedded-lab" ``` ```json { - "query": "how do I sign in to the VPN", - "client": "acme-corp", - "answer": "Okta credentials with Duo Push for 2FA", - "timings": { - "encode_query_s": 0.12, - "maxsim_s": 0.003, - "docvqa_s": 0.91, - "ocr_snippet_s": 0.84 - }, + "query": "Raspberry Pi Pico pinout GP21", + "client": "embedded-lab", + "answer": "GP21 can be used for ...", "results": [ { - "page_id": "ACME-101", - "client": "acme-corp", - "title": "VPN setup for new engineers", - "space": "Engineering", - "author": "alice@acme", - "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/101", - "page_image": "/pages/ACME-101.png", - "ocr_snippet": "VPN Setup for New Engineers · ...", + "page_id": "embedded-lab__raspberry-pi-pico-datasheet__p005", + "client": "embedded-lab", + "title": "Raspberry Pi Pico Datasheet", + "publisher": "Raspberry Pi Ltd", + "source_pdf": "raspberry-pi-pico-datasheet.pdf", + "page_number": 5, + "citation": "raspberry-pi-pico-datasheet.pdf · p.5", + "page_image": "/pages/embedded-lab/raspberry-pi-pico-datasheet_p005.png", "scores": { "maxsim": 14.44, "rerank": null } } ] @@ -133,48 +137,68 @@ curl "http://localhost:8888/api/search?q=how+do+I+sign+in+to+the+VPN&client=acme ### `GET /api/clients`, `GET /api/stats` -Tenant list and runtime config (active models, rerank on/off, page count). +Tenant list and runtime config: active models, rerank on/off, and page count. ## How it works -``` - ┌──────────────────────────────────────────────────────────────┐ - │ ingest.py (once per corpus) │ - │ pages.json ─▶ render_pages.py ─▶ data/pages/*.png │ - │ ─▶ SIE.encode(ColQwen2.5, images, multivector) │ - │ ─▶ data/multivectors.npz + data/metadata.json │ - └──────────────────────────────────────────────────────────────┘ - │ - ▼ - ┌──────────────────────────────────────────────────────────────┐ - │ search.py / server.py (per query) │ - │ q ─▶ SIE.encode(ColQwen2.5, text, is_query=True) │ - │ ─▶ filter metadata by tenant │ - │ ─▶ sie_sdk.scoring.maxsim → top_k_candidates │ - │ ─▶ [optional] SIE.score(Qwen3-VL-Reranker, q, images) │ - │ ─▶ SIE.extract(Florence-2-DocVQA, instruction=q, │ - │ images=[top_page]) ⇒ textual answer │ - │ ─▶ SIE.extract(Florence-2-DocVQA, images=[top_page]) │ - │ ⇒ OCR snippet (UI) │ - └──────────────────────────────────────────────────────────────┘ +```text + ingest.py (once per corpus) + fetch_pdfs.py -> data/pdfs/{tenant}/*.pdf + -> render_pages.py -> data/pages/{tenant}/*.png + -> data/pages_manifest.json + -> SIE.encode(ColQwen2.5, images, multivector) + -> data/multivectors.npz + data/metadata.json + + search.py / server.py (per query) + q -> SIE.encode(ColQwen2.5, text, is_query=True) + -> filter metadata by tenant + -> sie_sdk.scoring.maxsim -> top_k_candidates + -> optional SIE.score(Qwen3-VL-Reranker, q, images) + -> SIE.extract(Florence-2-DocVQA, instruction=q, images=[top_page]) + -> SIE.extract(Florence-2-DocVQA, images=[top_page]) for display OCR ``` -OCR is never on the score path. The visual reranker (when enabled) ranks -over the same modality as retrieval, so layout cues survive both stages. +OCR is never on the score path. The visual reranker, when enabled, ranks over +the same modality as retrieval, so layout cues survive both stages. -The corpus is small enough that MaxSim runs in Python. For thousands of -pages, hand the multivectors to LanceDB or Vespa; only the SIE calls stay -the same. +The corpus is small enough that MaxSim runs in Python. For thousands of pages, +hand the multivectors to LanceDB, Vespa, or another multivector store; the SIE +calls stay the same. ## Customize -`config.yaml` is the single tuning surface: +`data/fetch_pdfs.py` owns the curated source list. Add a source with: + +```python +{ + "client": "my-tenant", + "slug": "my-manual", + "title": "My Manual", + "publisher": "Example Publisher", + "license": "CC BY 4.0", + "url": "https://example.com/my-manual.pdf", + "pages": [1, 2, 7, 8], +} +``` + +Then rerun: + +```bash +python data/fetch_pdfs.py +python data/render_pages.py +python python/ingest.py +``` + +`config.yaml` is the model and rendering tuning surface: ```yaml models: - retriever: "vidore/colqwen2.5-v0.2" # smaller: vidore/colpali-v1.3-hf + retriever: "vidore/colqwen2.5-v0.2" docvqa: "mynkchaudhry/Florence-2-FT-DocVQA" - reranker: "Qwen/Qwen3-VL-Reranker-2B" # used only when search.visual_rerank: true + reranker: "Qwen/Qwen3-VL-Reranker-2B" +render: + backend: "auto" + dpi: 160 search: top_k_candidates: 5 top_k_results: 3 @@ -183,22 +207,18 @@ search: ocr_snippet: true ``` -Swap any model for another from the -[SIE model catalog](https://superlinked.com/models) and the pipeline keeps -working. - ## Project layout ```text examples/vision-doc-rag/ ├── config.yaml ├── data/ -│ ├── fetch_dataset.py # synthetic 3-tenant page corpus -│ ├── render_pages.py # pages.json → PNG screenshots -│ ├── pages.json # generated -│ ├── pages/ # generated PNGs -│ ├── metadata.json # generated by ingest -│ └── multivectors.npz # generated by ingest +│ ├── fetch_pdfs.py # curated public PDF source list + downloader +│ ├── render_pages.py # PDFs -> PNG pages + pages_manifest.json +│ ├── pdfs/ # generated +│ ├── pages/ # generated PNGs +│ ├── metadata.json # generated by ingest +│ └── multivectors.npz # generated by ingest ├── python/ │ ├── ingest.py │ ├── search.py diff --git a/examples/vision-doc-rag/config.yaml b/examples/vision-doc-rag/config.yaml index 8b35ffda..587548f2 100644 --- a/examples/vision-doc-rag/config.yaml +++ b/examples/vision-doc-rag/config.yaml @@ -25,14 +25,11 @@ models: # Re-enable with search.visual_rerank: true once that ships. reranker: "Qwen/Qwen3-VL-Reranker-2B" -# Page rendering (used by data/render_pages.py to turn the synthetic page -# corpus into PNGs; replace with pdf2image, screenshots, or your own files -# for a real deployment). +# Page rendering. `auto` tries pdf2image/Poppler first and falls back to +# PyMuPDF when Poppler is not installed. render: - width: 1024 - height: 1280 - body_font_size: 20 - title_font_size: 30 + backend: "auto" # auto | pdf2image | pymupdf + dpi: 160 # Retrieval search: diff --git a/examples/vision-doc-rag/data/fetch_dataset.py b/examples/vision-doc-rag/data/fetch_dataset.py deleted file mode 100644 index eb901a6c..00000000 --- a/examples/vision-doc-rag/data/fetch_dataset.py +++ /dev/null @@ -1,211 +0,0 @@ -"""Synthetic multi-tenant page corpus. - -Three fictional clients, each with a handful of pages — engineering runbooks, -HR policies, finance procedures. Small enough to encode in a minute on a warm -GPU cluster, varied enough to make multi-tenant filtering and visual retrieval -meaningful. Replace `PAGES` with your own pages (wiki export, Notion dump, -PDF batch, etc.) to point the demo at real content. -""" - -import json -from pathlib import Path - -PAGES = [ - # ── acme-corp: engineering ──────────────────────────────────────────── - { - "client": "acme-corp", - "page_id": "ACME-101", - "title": "VPN setup for new engineers", - "space": "Engineering", - "author": "alice@acme", - "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/101", - "body": [ - "All engineers need to connect through the corporate VPN to reach internal services.", - "We use Cisco AnyConnect on macOS and Windows, and the OpenConnect CLI on Linux.", - "Download the client from it.acme.com/vpn, then sign in with your Okta credentials.", - "Two-factor confirmation goes through Duo Push.", - "If you hit a TLS error on first connection, check that the device certificate from Jamf is installed.", - "For on-call rotations, request the always-on VPN profile from IT — it auto-reconnects after suspend.", - ], - }, - { - "client": "acme-corp", - "page_id": "ACME-102", - "title": "On-call rotation and paging", - "space": "Engineering", - "author": "bob@acme", - "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/102", - "body": [ - "Engineering on-call runs Monday to Monday handovers at 10:00 PT.", - "Primary takes the pager, secondary takes the laptop, both are paid the on-call stipend.", - "Pages route through PagerDuty; the escalation policy is primary -> secondary (15 min) -> manager.", - "During an incident open a Zoom bridge and a Slack channel named #inc-YYYYMMDD-summary.", - "Postmortems are due within five working days and live in the Incidents space.", - ], - }, - { - "client": "acme-corp", - "page_id": "ACME-103", - "title": "Deploying to production with our CI/CD pipeline", - "space": "Engineering", - "author": "carol@acme", - "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/103", - "body": [ - "We use GitHub Actions for CI and ArgoCD for delivery to Kubernetes.", - "Merging to main triggers a build, runs the test suite, pushes an image to ECR, and updates the staging manifest.", - "Production rollouts are gated by a manual approval in ArgoCD and require two reviewers from the service team.", - "Use the rolling strategy with maxSurge=25% by default.", - "Hotfix tags follow the pattern v1.2.3-hotfix.N and skip staging only with on-call approval recorded in the PR.", - ], - }, - { - "client": "acme-corp", - "page_id": "ACME-104", - "title": "Local development setup", - "space": "Engineering", - "author": "dan@acme", - "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/104", - "body": [ - "Install mise to manage runtimes — it pins Node, Python, and Go versions per repo.", - "Run `mise install` in the repo root, then `make dev` to spin up Postgres, Redis, and the API gateway in Docker.", - "The seed data covers the last 30 days of staging traffic, sanitized of PII.", - "If port 5432 is already taken, override DEV_PG_PORT in your shell profile.", - ], - }, - # ── globex: HR and admin ────────────────────────────────────────────── - { - "client": "globex", - "page_id": "GLOBEX-201", - "title": "Time off and vacation policy", - "space": "HR", - "author": "hr@globex", - "web_url": "https://globex.atlassian.net/wiki/spaces/HR/pages/201", - "body": [ - "Globex offers 25 working days of paid vacation per year, accruing monthly from the start date.", - "Requests go through Workday at least two weeks in advance for absences longer than three days.", - "Sick leave is separate and uncapped, but anything over three consecutive days requires a doctor's note.", - "Parental leave is 18 weeks at full pay for the primary caregiver and 6 weeks for the secondary, regardless of gender.", - "Unused vacation rolls over up to 10 days into the next calendar year; the rest is paid out.", - ], - }, - { - "client": "globex", - "page_id": "GLOBEX-202", - "title": "Expense reports and reimbursement", - "space": "HR", - "author": "finance@globex", - "web_url": "https://globex.atlassian.net/wiki/spaces/HR/pages/202", - "body": [ - "Submit expenses in Expensify within 30 days of the transaction.", - "Receipts are mandatory for any item over $25; below that, a description and category are enough.", - "Travel bookings should go through Navan when possible — direct bookings need pre-approval from your manager.", - "Reimbursements process every Friday and land in your payroll account the following Tuesday.", - "Per diem for international travel is $80 USD equivalent for meals.", - ], - }, - { - "client": "globex", - "page_id": "GLOBEX-203", - "title": "Office perks and meals", - "space": "HR", - "author": "office@globex", - "web_url": "https://globex.atlassian.net/wiki/spaces/HR/pages/203", - "body": [ - "Lunch is catered Monday through Thursday in the main cafe from 12:00 to 14:00.", - "There are always vegetarian, vegan, and gluten-free options labeled at the buffet.", - "Friday is a free-lunch credit you can spend at any partner restaurant in the office app.", - "Snacks and drinks in the micro-kitchens are unlimited; please refill empty trays.", - "The wellness stipend is $100 per month, claimable in Expensify under category Wellness.", - ], - }, - { - "client": "globex", - "page_id": "GLOBEX-204", - "title": "Office Wi-Fi and guest network", - "space": "IT", - "author": "it@globex", - "web_url": "https://globex.atlassian.net/wiki/spaces/IT/pages/204", - "body": [ - "Connect to Globex-Corp for the employee network; sign in with your @globex.com SSO.", - "Globex-Guest is for visitors — the rotating daily password is on the lobby screen.", - "Printing requires the Globex-Print network and a one-time pairing with your laptop using the Mobility Print app.", - "If your laptop will not join, forget the network and rejoin; the cert is renewed weekly and old caches get stuck.", - ], - }, - # ── initech: finance and compliance ─────────────────────────────────── - { - "client": "initech", - "page_id": "INIT-301", - "title": "SOX controls and quarterly attestation", - "space": "Compliance", - "author": "compliance@initech", - "web_url": "https://initech.atlassian.net/wiki/spaces/COMP/pages/301", - "body": [ - "Initech is subject to SOX 404 reporting for financial controls over revenue, expense, and access management.", - "Every quarter, control owners attest in AuditBoard that their controls operated as designed.", - "Evidence is automatically collected from Workday, NetSuite, and Okta where possible; manual evidence goes in the AuditBoard Drive folder.", - "External auditors test a sample of controls in Q3; expect requests for screenshots and approver lists.", - "Exceptions must be logged within five business days of detection.", - ], - }, - { - "client": "initech", - "page_id": "INIT-302", - "title": "Vendor onboarding and due diligence", - "space": "Procurement", - "author": "procurement@initech", - "web_url": "https://initech.atlassian.net/wiki/spaces/PROC/pages/302", - "body": [ - "New vendors above $50,000 annual spend require a security review and a SOC 2 Type II report on file.", - "Submit the vendor questionnaire through Vanta; legal will review the MSA within five business days.", - "Payment terms default to Net 60; faster terms require CFO approval and reduce the risk score in NetSuite.", - "Sanctioned-country checks run automatically via the OFAC integration; any hit halts the workflow until cleared.", - "Annual recertification of high-risk vendors happens every January.", - ], - }, - { - "client": "initech", - "page_id": "INIT-303", - "title": "Audit prep checklist", - "space": "Compliance", - "author": "audit@initech", - "web_url": "https://initech.atlassian.net/wiki/spaces/COMP/pages/303", - "body": [ - "Two weeks before the auditors arrive, freeze the control population in AuditBoard and export the evidence index.", - "Confirm with control owners that they will be available for walkthrough interviews — block 60 minutes in their calendars.", - "Pull the user access review reports for the prior two quarters from Okta and confirm sign-off in writing.", - "Have the change management JIRA queries ready: filter by label sox-relevant and status Done.", - "If a control failed mid-period, document the compensating control and the date the gap was closed.", - ], - }, - { - "client": "initech", - "page_id": "INIT-304", - "title": "Procurement card limits and exceptions", - "space": "Procurement", - "author": "procurement@initech", - "web_url": "https://initech.atlassian.net/wiki/spaces/PROC/pages/304", - "body": [ - "Procurement cards (P-cards) have a default monthly limit of $5,000 and a single-transaction limit of $1,500.", - "Use them for low-dollar, low-risk purchases — software subscriptions and conference tickets are the common cases.", - "Limit-increase requests need manager and CFO approval and a documented business need.", - "Personal use, cash advances, and split transactions to bypass the single-transaction limit are policy violations.", - "All P-card transactions reconcile in Coupa within 14 days of statement close.", - ], - }, -] - - -def main(): - out = Path(__file__).resolve().parent / "pages.json" - out.write_text(json.dumps(PAGES, indent=2)) - by_client = {} - for p in PAGES: - by_client[p["client"]] = by_client.get(p["client"], 0) + 1 - print(f"Wrote {len(PAGES)} pages to {out}") - for client, n in sorted(by_client.items()): - print(f" {client}: {n} pages") - - -if __name__ == "__main__": - main() diff --git a/examples/vision-doc-rag/data/fetch_pdfs.py b/examples/vision-doc-rag/data/fetch_pdfs.py new file mode 100644 index 00000000..21ade844 --- /dev/null +++ b/examples/vision-doc-rag/data/fetch_pdfs.py @@ -0,0 +1,158 @@ +"""Download the public PDF corpus for the visual document RAG demo. + +The corpus is intentionally small and curated. Each source has a tenant, a +stable slug, source metadata, and a limited page selection so the demo can be +indexed quickly while still containing diagrams, schematics, screenshots, and +technical figures that reward visual retrieval. +""" + +from __future__ import annotations + +import json +import shutil +import sys +import tempfile +from pathlib import Path +from urllib.error import HTTPError, URLError +from urllib.request import Request, urlopen + + +SOURCES = [ + { + "client": "embedded-lab", + "slug": "raspberry-pi-pico-datasheet", + "title": "Raspberry Pi Pico Datasheet", + "publisher": "Raspberry Pi Ltd", + "license": "CC BY-ND 4.0", + "url": "https://datasheets.raspberrypi.com/pico/pico-datasheet.pdf", + "pages": [4, 5, 6, 7, 8, 9], + }, + { + "client": "embedded-lab", + "slug": "arduino-uno-r3-datasheet", + "title": "Arduino UNO R3 Datasheet", + "publisher": "Arduino", + "license": "Arduino documentation / open hardware terms", + "url": "https://docs.arduino.cc/resources/datasheets/A000066-datasheet.pdf", + "pages": [5, 6, 7, 8, 9, 10, 11], + }, + { + "client": "embedded-lab", + "slug": "arduino-uno-r3-schematic", + "title": "Arduino UNO R3 Schematic", + "publisher": "Arduino", + "license": "CC BY-SA 4.0 hardware reference design", + "url": "https://docs.arduino.cc/resources/schematics/A000066-schematics.pdf", + "pages": [1, 2], + }, + { + "client": "ops-eng", + "slug": "postgresql-18-manual", + "title": "PostgreSQL 18 Documentation", + "publisher": "PostgreSQL Global Development Group", + "license": "PostgreSQL License", + "url": "https://www.postgresql.org/files/documentation/pdf/18/postgresql-18-A4.pdf", + "pages": [19, 20, 21, 22, 23, 24], + }, + { + "client": "ops-eng", + "slug": "kubernetes-infrastructure-abstraction", + "title": "Kubernetes as Infrastructure Abstraction", + "publisher": "Cloud Native Computing Foundation", + "license": "CNCF public presentation material", + "url": "https://www.cncf.io/wp-content/uploads/2020/08/2019-09-Kubernetes-as-Infrastructure-Abstraction.pdf", + "pages": [6, 7, 8, 9, 10, 11], + }, + { + "client": "ops-eng", + "slug": "cloud-native-ai-whitepaper", + "title": "Cloud Native Artificial Intelligence Whitepaper", + "publisher": "Cloud Native Computing Foundation", + "license": "CNCF documentation / report terms", + "url": "https://www.cncf.io/wp-content/uploads/2024/03/cloud_native_ai24_031424a-2.pdf", + "pages": [11, 12, 13, 14, 15, 16], + }, + { + "client": "aerospace", + "slug": "solid-rocket-motor-nozzles", + "title": "Solid Rocket Motor Nozzles", + "publisher": "NASA Technical Reports Server", + "license": "NASA STI public release", + "url": "https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19760013126.pdf", + "pages": [1, 2, 3, 4, 5, 6], + }, + { + "client": "aerospace", + "slug": "liquid-rocket-engine-nozzles", + "title": "Liquid Rocket Engine Nozzles", + "publisher": "NASA Technical Reports Server", + "license": "NASA STI public release", + "url": "https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19770009165.pdf", + "pages": [1, 2, 3, 4, 5, 6], + }, + { + "client": "aerospace", + "slug": "sls-booster-state-machine", + "title": "State Machine Modeling of the Space Launch System Solid Rocket Boosters", + "publisher": "NASA Technical Reports Server", + "license": "NASA STI public release", + "url": "https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20160000328.pdf", + "pages": [1, 2, 3, 4, 5, 6], + }, +] + + +def _download(url: str, out: Path) -> bool: + """Download url to out atomically. Return True when a new file was written.""" + if out.exists() and out.stat().st_size > 0: + return False + + out.parent.mkdir(parents=True, exist_ok=True) + request = Request( + url, + headers={ + "User-Agent": "sie-vision-doc-rag-demo/1.0", + "Accept": "application/pdf,*/*", + }, + ) + with tempfile.NamedTemporaryFile(delete=False, dir=out.parent, suffix=".tmp") as tmp: + tmp_path = Path(tmp.name) + try: + with urlopen(request, timeout=60) as response: + shutil.copyfileobj(response, tmp) + except (HTTPError, URLError, TimeoutError): + tmp_path.unlink(missing_ok=True) + raise + + tmp_path.replace(out) + return True + + +def main() -> None: + here = Path(__file__).resolve().parent + pdf_root = here / "pdfs" + manifest = [] + + for source in SOURCES: + pdf_path = pdf_root / source["client"] / f"{source['slug']}.pdf" + try: + downloaded = _download(source["url"], pdf_path) + except Exception as exc: + print(f"Failed to download {source['url']}: {type(exc).__name__}: {exc}", file=sys.stderr) + raise + + row = dict(source) + row["pdf_path"] = str(pdf_path.relative_to(here)) + row["source_pdf"] = pdf_path.name + manifest.append(row) + + status = "downloaded" if downloaded else "cached" + print(f" {status:10s} {source['client']:12s} {source['slug']} -> {row['pdf_path']}") + + out = here / "pdfs_manifest.json" + out.write_text(json.dumps({"sources": manifest}, indent=2) + "\n") + print(f"\nWrote {len(manifest)} PDF sources to {out}") + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/data/render_pages.py b/examples/vision-doc-rag/data/render_pages.py index 4043d71b..9c123305 100644 --- a/examples/vision-doc-rag/data/render_pages.py +++ b/examples/vision-doc-rag/data/render_pages.py @@ -1,105 +1,146 @@ -"""Render the synthetic pages to PNG screenshots. +"""Rasterize the curated PDF corpus to page PNGs. -Each entry in pages.json becomes one image in data/pages/.png. The -layout is intentionally plain — a title, a metadata line, and a body block — -so ColQwen2.5 sees the same kind of visual structure it would in real wikis, -docs, or PDFs. Replace this script with `pdf2image` (or screenshots) when -pointing at real content. +The script tries pdf2image first because it produces excellent page images +when Poppler is installed. If Poppler or pdf2image is unavailable, it falls +back to PyMuPDF so the demo still works with only Python package dependencies. """ +from __future__ import annotations + import json import sys from pathlib import Path import yaml -from PIL import Image, ImageDraw, ImageFont - - -def _font(size: int): - """Try the platform Helvetica, fall back to PIL's default bitmap font.""" - for path in [ - "/System/Library/Fonts/Helvetica.ttc", - "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", - "/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf", - ]: - if Path(path).exists(): - return ImageFont.truetype(path, size) - return ImageFont.load_default() - - -def _wrap(text: str, font: ImageFont.ImageFont, max_width: int) -> list[str]: - """Greedy word wrap so body paragraphs fit the page width.""" - lines: list[str] = [] - for paragraph in text.split("\n"): - words = paragraph.split() - current = "" - for word in words: - candidate = f"{current} {word}".strip() - if font.getlength(candidate) <= max_width: - current = candidate - else: - if current: - lines.append(current) - current = word - if current: - lines.append(current) - return lines - - -def render_page(page: dict, width: int, height: int, body_size: int, title_size: int) -> Image.Image: - img = Image.new("RGB", (width, height), "white") - draw = ImageDraw.Draw(img) - title_font = _font(title_size) - meta_font = _font(int(body_size * 0.9)) - body_font = _font(body_size) - - margin = 48 - cursor_y = margin - draw.text((margin, cursor_y), page["title"], fill="black", font=title_font) - cursor_y += int(title_size * 1.6) - meta = f"{page['space']} · {page['author']} · {page['page_id']}" - draw.text((margin, cursor_y), meta, fill=(96, 96, 96), font=meta_font) - cursor_y += int(title_size * 1.2) - draw.line([(margin, cursor_y), (width - margin, cursor_y)], fill=(200, 200, 200), width=2) - cursor_y += int(body_size * 1.2) - - max_text_width = width - 2 * margin - line_gap = int(body_size * 1.5) - for bullet in page["body"]: - # Render each body line as a wrapped paragraph block. - lines = _wrap(bullet, body_font, max_text_width) - for line in lines: - draw.text((margin, cursor_y), line, fill="black", font=body_font) - cursor_y += line_gap - cursor_y += int(line_gap * 0.4) # paragraph spacing - - return img - - -def main(): + + +def _selected_pages(source: dict, total_pages: int) -> list[int]: + pages = source.get("pages") + if pages: + selected = [int(p) for p in pages if 1 <= int(p) <= total_pages] + else: + start = int(source.get("start_page", 1)) + max_pages = int(source.get("max_pages", 6)) + selected = list(range(start, min(total_pages, start + max_pages - 1) + 1)) + + if not selected: + raise ValueError(f"No valid pages selected for {source['slug']} ({total_pages} pages)") + return selected + + +def _pdf_page_count_with_pymupdf(pdf_path: Path) -> int: + import fitz + + with fitz.open(pdf_path) as doc: + return doc.page_count + + +def _render_with_pdf2image(pdf_path: Path, page_number: int, out_path: Path, dpi: int) -> None: + from pdf2image import convert_from_path + + images = convert_from_path( + str(pdf_path), + dpi=dpi, + first_page=page_number, + last_page=page_number, + fmt="png", + single_file=True, + ) + if not images: + raise RuntimeError(f"pdf2image returned no image for {pdf_path} page {page_number}") + images[0].save(out_path) + + +def _render_with_pymupdf(pdf_path: Path, page_number: int, out_path: Path, dpi: int) -> None: + import fitz + + zoom = dpi / 72 + matrix = fitz.Matrix(zoom, zoom) + with fitz.open(pdf_path) as doc: + page = doc.load_page(page_number - 1) + pixmap = page.get_pixmap(matrix=matrix, alpha=False) + pixmap.save(out_path) + + +def _render_page(pdf_path: Path, page_number: int, out_path: Path, dpi: int, backend: str) -> str: + out_path.parent.mkdir(parents=True, exist_ok=True) + if backend in {"auto", "pdf2image"}: + try: + _render_with_pdf2image(pdf_path, page_number, out_path, dpi) + return "pdf2image" + except Exception as exc: + if backend == "pdf2image": + raise + print( + f" pdf2image unavailable for {pdf_path.name} p.{page_number} " + f"({type(exc).__name__}); falling back to PyMuPDF", + file=sys.stderr, + ) + + _render_with_pymupdf(pdf_path, page_number, out_path, dpi) + return "pymupdf" + + +def main() -> None: here = Path(__file__).resolve().parent - pages_path = here / "pages.json" - if not pages_path.exists(): - print("pages.json not found; run fetch_dataset.py first", file=sys.stderr) + root = here.parent + manifest_path = here / "pdfs_manifest.json" + if not manifest_path.exists(): + print("pdfs_manifest.json not found; run `python data/fetch_pdfs.py` first", file=sys.stderr) sys.exit(1) - config = yaml.safe_load((here.parent / "config.yaml").read_text()) - render = config["render"] + + config = yaml.safe_load((root / "config.yaml").read_text()) + render_config = config.get("render", {}) + dpi = int(render_config.get("dpi", 160)) + backend = render_config.get("backend", "auto") + active_backend = backend out_dir = here / "pages" - out_dir.mkdir(exist_ok=True) - - pages = json.loads(pages_path.read_text()) - for p in pages: - img = render_page( - p, - width=render["width"], - height=render["height"], - body_size=render["body_font_size"], - title_size=render["title_font_size"], - ) - out = out_dir / f"{p['page_id']}.png" - img.save(out) - print(f" {p['client']:10s} {p['page_id']:10s} -> {out.relative_to(here.parent)}") - print(f"Rendered {len(pages)} pages to {out_dir}") + + pdf_manifest = json.loads(manifest_path.read_text()) + page_manifest: list[dict] = [] + backend_counts: dict[str, int] = {} + + for source in pdf_manifest["sources"]: + pdf_path = here / source["pdf_path"] + if not pdf_path.exists(): + raise FileNotFoundError(f"Missing PDF: {pdf_path}. Run data/fetch_pdfs.py.") + + total_pages = _pdf_page_count_with_pymupdf(pdf_path) + for page_number in _selected_pages(source, total_pages): + page_id = f"{source['client']}__{source['slug']}__p{page_number:03d}" + image_path = out_dir / source["client"] / f"{source['slug']}_p{page_number:03d}.png" + used_backend = _render_page(pdf_path, page_number, image_path, dpi, active_backend) + if backend == "auto" and used_backend == "pymupdf": + active_backend = "pymupdf" + backend_counts[used_backend] = backend_counts.get(used_backend, 0) + 1 + + rel_image_path = image_path.relative_to(here) + page_manifest.append( + { + "page_id": page_id, + "client": source["client"], + "title": source["title"], + "publisher": source["publisher"], + "license": source["license"], + "source_url": source["url"], + "source_pdf": source["source_pdf"], + "source_pdf_path": source["pdf_path"], + "page_number": page_number, + "image_path": str(rel_image_path), + } + ) + print( + f" {source['client']:12s} {source['slug']:38s} " + f"p.{page_number:<4d} -> data/{rel_image_path}" + ) + + out = here / "pages_manifest.json" + out.write_text(json.dumps(page_manifest, indent=2) + "\n") + + print(f"\nRendered {len(page_manifest)} pages to {out_dir}") + print(f"Wrote page manifest to {out}") + for name, count in sorted(backend_counts.items()): + print(f" {name}: {count} pages") if __name__ == "__main__": diff --git a/examples/vision-doc-rag/python/ingest.py b/examples/vision-doc-rag/python/ingest.py index 15607f30..8b0f8e11 100644 --- a/examples/vision-doc-rag/python/ingest.py +++ b/examples/vision-doc-rag/python/ingest.py @@ -1,9 +1,10 @@ """Build the per-tenant visual index. -For every page PNG we ask SIE to encode the image with vidore/colqwen2.5-v0.2, -which returns a [tokens, 128] multivector. Each page's multivector goes into a -single .npz on disk, alongside a metadata.json that keeps the client name, -page id, title, and source url for routing and filtering at query time. +For every rendered PDF page PNG we ask SIE to encode the image with +vidore/colqwen2.5-v0.2, which returns a [tokens, 128] multivector. Each page's +multivector goes into a single .npz on disk, alongside a metadata.json that +keeps the client name, source PDF, page number, and source URL for routing, +filtering, and citation at query time. There is no vector database here. MaxSim at the scale of one team's wiki (hundreds to thousands of pages) is cheap and avoids the indexing step. @@ -30,22 +31,22 @@ def load_config(): def load_pages(): - pages_path = Path(__file__).resolve().parent.parent / "data" / "pages.json" + pages_path = Path(__file__).resolve().parent.parent / "data" / "pages_manifest.json" if not pages_path.exists(): raise FileNotFoundError( - "data/pages.json not found. Run `python data/fetch_dataset.py` " + "data/pages_manifest.json not found. Run `python data/fetch_pdfs.py` " "and `python data/render_pages.py` first." ) return json.loads(pages_path.read_text()) def encode_pages(client: SIEClient, model: str, pages: list[dict], gpu: str, timeout: float): - pages_dir = Path(__file__).resolve().parent.parent / "data" / "pages" + data_dir = Path(__file__).resolve().parent.parent / "data" multivectors: list[np.ndarray] = [] metadata: list[dict] = [] for i, page in enumerate(pages, 1): - image_path = pages_dir / f"{page['page_id']}.png" + image_path = data_dir / page["image_path"] if not image_path.exists(): raise FileNotFoundError(f"Missing page image: {image_path}. Run data/render_pages.py.") @@ -66,14 +67,18 @@ def encode_pages(client: SIEClient, model: str, pages: list[dict], gpu: str, tim "page_id": page["page_id"], "client": page["client"], "title": page["title"], - "space": page["space"], - "author": page["author"], - "web_url": page["web_url"], - "image_path": str(image_path.relative_to(image_path.parent.parent.parent)), + "publisher": page["publisher"], + "license": page["license"], + "source_url": page["source_url"], + "source_pdf": page["source_pdf"], + "source_pdf_path": page["source_pdf_path"], + "page_number": page["page_number"], + "image_path": page["image_path"], "num_tokens": int(mv.shape[0]), } ) - print(f" [{i}/{len(pages)}] {page['page_id']:10s} {page['client']:10s} {mv.shape} in {elapsed:.1f}s") + citation = f"{page['source_pdf']} · p.{page['page_number']}" + print(f" [{i}/{len(pages)}] {page['client']:12s} {citation:44s} {mv.shape} in {elapsed:.1f}s") return multivectors, metadata diff --git a/examples/vision-doc-rag/python/requirements.txt b/examples/vision-doc-rag/python/requirements.txt index bd32dcbc..1ea77ae9 100644 --- a/examples/vision-doc-rag/python/requirements.txt +++ b/examples/vision-doc-rag/python/requirements.txt @@ -1,6 +1,8 @@ -sie-sdk==0.1.10 +sie-sdk==0.3.4 fastapi>=0.115.0 uvicorn>=0.30.0 numpy>=1.26.0 pyyaml>=6.0 Pillow>=10.3.0 +pdf2image>=1.17.0 +PyMuPDF>=1.24.0 diff --git a/examples/vision-doc-rag/python/search.py b/examples/vision-doc-rag/python/search.py index 52dd2211..a4d6b3ca 100644 --- a/examples/vision-doc-rag/python/search.py +++ b/examples/vision-doc-rag/python/search.py @@ -44,6 +44,15 @@ def load_index(): raise FileNotFoundError("data/multivectors.npz missing. Run `python python/ingest.py` first.") npz = np.load(data_dir / "multivectors.npz") metadata = json.loads((data_dir / "metadata.json").read_text()) + required = {"page_id", "client", "source_pdf", "page_number", "image_path", "publisher", "source_url"} + if metadata: + missing = required - set(metadata[0]) + if missing: + raise ValueError( + "data/metadata.json was generated by an older corpus shape. " + "Run `python data/fetch_pdfs.py`, `python data/render_pages.py`, " + "then `python python/ingest.py`." + ) multivectors = {m["page_id"]: npz[m["page_id"]] for m in metadata} return multivectors, metadata @@ -173,7 +182,7 @@ def search( provision_timeout_s=timeout, ) timings["docvqa_s"] = round(time.time() - t0, 3) - answer = _docvqa_answer(qa[0].get("entities", []) if qa else []) + answer = _docvqa_answer(qa["entities"]) except Exception as exc: timings["docvqa_error"] = type(exc).__name__ @@ -191,7 +200,7 @@ def search( provision_timeout_s=timeout, ) timings["ocr_snippet_s"] = round(time.time() - t0, 3) - top["ocr_snippet"] = _ocr_snippet(ocr[0].get("entities", []) if ocr else []) + top["ocr_snippet"] = _ocr_snippet(ocr["entities"]) except Exception as exc: timings["ocr_snippet_error"] = type(exc).__name__ @@ -211,11 +220,11 @@ def print_run(out: dict, query: str, client_filter: str | None): rerank = r.get("_rerank_score") rerank_str = f"rerank={rerank:.4f}" if rerank is not None else "rerank=—" print(f"\n {i}. [{r['client']}] {r['title']}") - print(f" {r['page_id']} · {r['space']} · {r['author']}") + print(f" {r['source_pdf']} · p.{r['page_number']} · {r['publisher']}") print(f" maxsim={r['_maxsim_score']:.3f} {rerank_str}") if r.get("ocr_snippet"): print(f" OCR snippet: {r['ocr_snippet'][:200]}") - print(f" url: {r['web_url']}") + print(f" url: {r['source_url']}") def main(): @@ -227,11 +236,11 @@ def main(): api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) demo = [ - ("how do I sign in to the VPN?", "acme-corp"), - ("what is the parental leave policy?", "globex"), - ("audit prep evidence and walkthroughs", "initech"), + ("Raspberry Pi Pico pinout GP21", "embedded-lab"), + ("cloud native architecture diagram", "ops-eng"), + ("solid rocket motor nozzle design figure", "aerospace"), # No tenant filter: shows the query routes across tenants. - ("expense reports and per diem", None), + ("ATmega16U2 power tree diagram", None), ] with SIEClient(cluster_url, api_key=api_key) as client: for query, tenant in demo: diff --git a/examples/vision-doc-rag/python/server.py b/examples/vision-doc-rag/python/server.py index d61e5962..990857fa 100644 --- a/examples/vision-doc-rag/python/server.py +++ b/examples/vision-doc-rag/python/server.py @@ -81,10 +81,13 @@ def api_search( "page_id": r["page_id"], "client": r["client"], "title": r["title"], - "space": r["space"], - "author": r["author"], - "web_url": r["web_url"], - "page_image": f"/pages/{r['page_id']}.png", + "publisher": r["publisher"], + "license": r["license"], + "source_url": r["source_url"], + "source_pdf": r["source_pdf"], + "page_number": r["page_number"], + "citation": f"{r['source_pdf']} · p.{r['page_number']}", + "page_image": f"/{r['image_path']}", "ocr_snippet": r.get("ocr_snippet", ""), "scores": { "maxsim": round(r["_maxsim_score"], 4), diff --git a/examples/vision-doc-rag/static/index.html b/examples/vision-doc-rag/static/index.html index 392c8791..2b3eb7c3 100644 --- a/examples/vision-doc-rag/static/index.html +++ b/examples/vision-doc-rag/static/index.html @@ -120,7 +120,7 @@

Multi-Tenant Visual Doc Search + QA

- +
@@ -135,6 +135,14 @@

Multi-Tenant Visual Doc Search + QA

const answerEl = document.getElementById("answer"); const statsEl = document.getElementById("stats"); + function escapeHtml(value) { + return String(value ?? "") + .replace(/&/g, "&") + .replace(//g, ">") + .replace(/"/g, """); + } + async function loadStats() { const r = await fetch("/api/stats").then(r => r.json()); for (const c of r.clients) { @@ -162,7 +170,7 @@

Multi-Tenant Visual Doc Search + QA

answerEl.innerHTML = `
Answer (Florence-2-DocVQA)
-
${res.answer.replace(/ +
${escapeHtml(res.answer)}
`; } if (!res.results.length) { @@ -173,12 +181,13 @@

Multi-Tenant Visual Doc Search + QA

const rerank = r.scores.rerank == null ? "—" : r.scores.rerank; return `
- ${r.title} + ${escapeHtml(r.title)}
-
${r.title}
-
${r.client} ${r.space} · ${r.author} · ${r.page_id}
+
${escapeHtml(r.title)}
+
${escapeHtml(r.client)} ${escapeHtml(r.citation)} · ${escapeHtml(r.publisher)}
+
${escapeHtml(r.license)} · source PDF
maxsim=${r.scores.maxsim} rerank=${rerank}
- ${r.ocr_snippet ? `
${r.ocr_snippet.replace(/` : ""} + ${r.ocr_snippet ? `
${escapeHtml(r.ocr_snippet)}
` : ""}
`; }).join(""); From d60063dabf1d9f93bc7cca73c38bb4bda62145be Mon Sep 17 00:00:00 2001 From: svonava Date: Sun, 24 May 2026 23:05:05 -0700 Subject: [PATCH 3/3] examples(vision-doc-rag): expand test query set Grow the CLI demo from 4 to 8 queries (one each from new categories), and replace the README's 5-row "Try these queries" table with 4 sectioned tables (17 rows) covering visual signal, table/value lookups, disambiguation pairs, and tenant-leak negatives. Each row now names the expected target page so ranking regressions show up at a glance. Co-Authored-By: Claude Opus 4.7 (1M context) --- examples/vision-doc-rag/README.md | 44 ++++++++++++++++++++---- examples/vision-doc-rag/python/search.py | 9 +++++ 2 files changed, 47 insertions(+), 6 deletions(-) diff --git a/examples/vision-doc-rag/README.md b/examples/vision-doc-rag/README.md index f2ca3775..3f3c586f 100644 --- a/examples/vision-doc-rag/README.md +++ b/examples/vision-doc-rag/README.md @@ -93,13 +93,45 @@ explicit GPU class. ## Try these queries -| Tenant | Query | Why it's interesting | +Queries are grouped by what they exercise. Each row names the expected target +page so you can spot regressions at a glance. + +### Visual signal — the ranking comes from the page image, not OCR + +| Tenant | Query | Expected target | Why it's interesting | +|---|---|---|---| +| `embedded-lab` | Raspberry Pi Pico pinout GP21 | Pi Pico datasheet pinout (pp 4-5) | Abbreviated visual label still drives retrieval. | +| `embedded-lab` | where is the ATmega16U2 on the schematic? | Arduino UNO R3 schematic (pp 1-2) | Circuit schematic retrieval, not prose. | +| `ops-eng` | cloud native architecture diagram | CNCF AI whitepaper or Kubernetes slides | Visual architecture page instead of OCR text. | +| `aerospace` | solid rocket motor nozzle design figure | Solid rocket motor nozzles report | Engineering drawing in a figure-heavy report. | + +### Table / value lookups — the DocVQA answer is the point + +| Tenant | Query | Expected target | Expected answer | +|---|---|---|---| +| `embedded-lab` | What is the operating voltage range of the Raspberry Pi Pico? | Pi Pico datasheet electrical characteristics (pp 6-8) | A voltage range, e.g. 1.8-5.5 V | +| `embedded-lab` | Which Arduino UNO pin is the built-in LED on? | UNO R3 datasheet pinout (pp 5-11) | D13 / PB5 | +| `ops-eng` | PostgreSQL default listening port | PG 18 manual config section (pp 19-24) | 5432 | +| `ops-eng` | What is the default value of max_connections in PostgreSQL? | PG 18 manual parameter table (pp 19-24) | 100 | +| `aerospace` | What is the throat diameter shown in the nozzle drawing? | Nozzle design figure | A labeled dimension off the drawing | + +### Disambiguation — two PDFs in one tenant, the right one must win + +| Tenant | Query | Should pick | Should beat | +|---|---|---|---| +| `aerospace` | solid propellant rocket nozzle cross-section | `solid-rocket-motor-nozzles.pdf` | `liquid-rocket-engine-nozzles.pdf` | +| `aerospace` | regeneratively cooled nozzle | `liquid-rocket-engine-nozzles.pdf` (regen cooling is liquid-specific) | `solid-rocket-motor-nozzles.pdf` | +| `embedded-lab` | USB-to-serial interface chip on the schematic | `arduino-uno-r3-schematic.pdf` (ATmega16U2) | `raspberry-pi-pico-datasheet.pdf` | +| `embedded-lab` | RP2040 GPIO function table | `raspberry-pi-pico-datasheet.pdf` | `arduino-uno-r3-datasheet.pdf` | + +### Tenant-leak negatives — the matching content lives in a different tenant + +| Scoped to | Query | Pass condition | |---|---|---| -| `embedded-lab` | Raspberry Pi Pico pinout GP21 | Should land on a pinout/table page even when the visual label is abbreviated. | -| `embedded-lab` | where is the ATmega16U2 on the schematic? | Circuit schematic retrieval, not prose retrieval. | -| `ops-eng` | cloud native architecture diagram | Finds a visual architecture page or slide instead of relying on OCR text only. | -| `aerospace` | solid rocket motor nozzle design figure | Targets an engineering drawing or figure-heavy report page. | -| `ops-eng` | Raspberry Pi Pico pinout GP21 | Tenant filter: the query cannot leak embedded-lab pages when scoped to ops-eng. | +| `ops-eng` | Raspberry Pi Pico pinout GP21 | No embedded-lab pages return. | +| `ops-eng` | regeneratively cooled nozzle | No aerospace pages return. | +| `aerospace` | cloud native architecture diagram | No ops-eng pages return. | +| `embedded-lab` | PostgreSQL connection pool | No ops-eng pages return. | ## API diff --git a/examples/vision-doc-rag/python/search.py b/examples/vision-doc-rag/python/search.py index a4d6b3ca..74f86cd1 100644 --- a/examples/vision-doc-rag/python/search.py +++ b/examples/vision-doc-rag/python/search.py @@ -236,11 +236,20 @@ def main(): api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) demo = [ + # Visual signal — ranking is driven by the page image. ("Raspberry Pi Pico pinout GP21", "embedded-lab"), ("cloud native architecture diagram", "ops-eng"), ("solid rocket motor nozzle design figure", "aerospace"), # No tenant filter: shows the query routes across tenants. ("ATmega16U2 power tree diagram", None), + # Table / value lookup — DocVQA must return a specific value, not the title. + ("What is the operating voltage range of the Raspberry Pi Pico?", "embedded-lab"), + ("PostgreSQL default listening port", "ops-eng"), + # Disambiguation — two PDFs in one tenant; the right one must win. + ("solid propellant rocket nozzle cross-section", "aerospace"), + # Tenant-leak negative — the matching content lives in aerospace; scoping + # to ops-eng must return no aerospace pages. + ("regeneratively cooled nozzle", "ops-eng"), ] with SIEClient(cluster_url, api_key=api_key) as client: for query, tenant in demo: