| Document | What it shows |
|---|---|
| README.md | Problem framing, approach, eval results, architecture |
| EVAL_REPORT.md | Full benchmark report — per-service precision/recall, error analysis, post-processing rules |
| eval/results/battery_results.txt | Raw terminal output for all 25 services — unedited, verifiable |
| eval/scan_full_battery.py | Full 25-service evaluation script — reproduces all results with a Gemini API key |
| eval/generate_eval_charts.py | Chart generation script — produces all 5 evaluation charts |
| server/postprocess.ts | Post-processing validation rules (D1–D7) |
| server/prompts.ts | Prompt engineering — ensemble prompts + Privacy Policy scan prompt |
Results are fully reproducible. Run
python -X utf8 eval/scan_full_battery.pywith a Gemini API key to verify.
Terms of Service and Privacy Policy documents average 5,000–20,000 words. 91% of users never read them. Yet these documents contain clauses that authorize AI training on personal data, third-party data selling, and forced arbitration — all with real legal consequences.
Business KPI: Reduce time to understand privacy risk from ~30 minutes (manual reading) to ~30 seconds (automated classification), with measurable precision and recall against ground truth labels from tosdr.org.
A simple keyword matcher (baseline) achieves ~55% recall — it misses violations expressed in indirect language ("trusted partners", "personalized content", "ecosystem partners"). Legal language is deliberately evasive.
A single gemini-2.5-flash call achieves ~80% recall but suffers from false positives — it hallucinates violations from ban clauses ("you may not use automated means...") and misclassifies feedback submission clauses as content ownership violations.
Primary Model (Flash) ──┐
├──► Ensemble Merge ──► Post-Processing (D1–D7) ──► Final Result
Corroborator (Flash-Lite) ┘ ↑ ↑
HIGH confidence Deterministic
gate required rule overrides
- Ensemble: Flash + Flash-Lite must agree at HIGH confidence for a violation to be flagged
- Post-processing rules (D1–D7): Deterministic code overrides model decisions for known failure modes
- Privacy Policy co-scan: Privacy Policy fetched separately for
data_selling— this information lives in the Privacy Policy, not the Terms of Service - NULL HYPOTHESIS: Default is no violation — the model must provide verbatim citation as proof before a flag is accepted
Benchmarked against 25 real services across tosdr.org grades A–F using tosdr.org grades as ground truth.
| Scan Mode | Rating Accuracy | Precision | Recall | Avg Latency |
|---|---|---|---|---|
| Basic (Flash only) | 22/25 | 89% | 79% | ~12s |
| Deep (Ensemble) | 25/25 | 94% | 93% | ~25s |
Ensemble gain over single model: +14% recall, +5% precision.
True Negative Rate: 6/6 — zero false positives on Grade A+B (clean) services.
Figure 1 — BASIC vs DEEP aggregate metrics across 25 services
Figure 2 — Per-service Precision and Recall for DEEP scan
Figure 3 — False Negative and False Positive counts by privacy pillar
Figure 4 — Grade distribution and average recall per grade tier
Figure 5 — Per-service accuracy grid (green = correct, red = incorrect)
Full per-service results with precision/recall breakdowns in EVAL_REPORT.md.
| # | Pillar | What It Detects |
|---|---|---|
| 1 | AI Training | Service uses your data to train AI models without explicit consent |
| 2 | Data Selling | Data shared with third parties for their own commercial benefit |
| 3 | Transparency | Intentionally vague, evasive, or confusing language |
| 4 | Data Retention | No clear deletion path or excessive retention after account closure |
| 5 | Content Ownership | Broad sublicensable license to user-generated content |
| 6 | Dark Patterns | Forced arbitration, class action waivers, liability caps |
Structured error analysis across 25 services identified the root cause of every false positive and false negative. Deterministic rules (D1–D7) override model output for known failure modes:
| Rule | Type | Problem | Fix |
|---|---|---|---|
| D1 | False positive fix | ai_training flagged without "train"/"fine-tune" in the cited text |
Require a training-related keyword in the citation |
| D2 | False positive fix | Ban clauses flagged as violations ("you may not use automated means") | Blocklist of prohibition-prefix patterns |
| D3 | False positive fix | transparency flagged on scoped policy subsections |
Detect section-scoping language and clear |
| D4 | False positive fix | Feedback/submission clauses misclassified as content_ownership |
Detect whether clause covers incoming feedback vs. published content |
| D5 | False positive fix | Privacy Policy scan fires on service-provider-only policies | Skip model call if Privacy Policy has zero commercial-sharing keywords |
| D6 | False positive fix | data_retention flagged on payment delinquency/suspension clauses |
Detect delinquent-account language and clear |
| D7 | False positive fix | dark_patterns flagged on generic liability-limit boilerplate |
Require explicit cap amount ("shall not exceed", "$X") before flagging |
Before D1–D7: Deep precision ~65%, multiple false positives per service.
After D1–D7: Deep precision 94%, false positives isolated to structural data_selling ambiguity.
Three systematic failure modes required non-model solutions:
1. Ban clauses look like violations
"using automated means to access content from any of our services" — Google ToS
The model flags this as
ai_training. A human reads it as a prohibition. D2 detects the context and overrides.
2. Feedback clauses look like content ownership
"Netflix is free to use any comments, information, ideas, concepts, feedback..." — Netflix ToS
The model flags this as
content_ownership. D4 detects "feedback/comments" without published-content markers and clears it.
3. Data selling language lives in the Privacy Policy, not the Terms of Service
Terms of Service rarely mention data brokers. A separate Privacy Policy scan fetches and analyzes the Privacy Policy using a dedicated prompt tuned for commercial sharing language — catching indirect phrasing like "marketing partners", "advertising ecosystem".
┌────────────────────────── Browser (Chrome / Firefox) ──────────────────────────┐
│ │
│ content.js background.js (SW) popup.html / popup.js │
│ ┌────────────────┐ ┌──────────────────┐ ┌────────────────────────────┐ │
│ │ Detect T&C │ │ SSE stream reader │ │ Tier picker │ │
│ │ Extract text │───▶│ Auth token attach │ │ ELI5 / dark patterns │ │
│ │ Inject badge │◀───│ Credit error UI │ │ Sign-in / credits │ │
│ │ Highlight cite │ │ Keepalive pings │ │ GDPR email / batch scan │ │
│ └────────────────┘ └──────────────────┘ └────────────────────────────┘ │
└────────────────────────────────┬──┬──────────────────────────────────────────────┘
│ │ SSE
┌────────────▼──┴──────────────────────────────────┐
│ Express Backend (Google Cloud Run) │
│ │
│ 1. Firebase Auth token verify │
│ 2. Credit deduction (Firestore transaction) │
│ 3. L1 in-memory LRU cache lookup │
│ 4. L2 Firestore shared_cache lookup │
│ 5. Sentence-aware chunking (compromise NLP) │
│ 6. Privacy Policy co-scan (data_selling) │
│ 7. LLM inference — Flash primary │
│ 8. LLM corroboration — Flash-Lite ensemble │
│ 9. Ensemble merge (HIGH confidence gate) │
│ 10. Post-processing validation (D1–D7 rules) │
│ 11. Citation grounding + JSON extraction │
│ 12. Aggregation + score computation │
│ 13. Write to L1 + L2 cache │
│ 14. SSE stream result to extension │
└───────────────────────────────────────────────────┘
│
┌────────────────────▼──────────────────────────────┐
│ Google Gemini API (AI Studio) │
│ Primary: gemini-2.5-flash │
│ Corroborator: gemini-2.5-flash-lite │
└───────────────────────────────────────────────────┘
| Output | Description |
|---|---|
| Rating badge | SAFE / OKAY / RISKY injected into the page |
| Privacy score | 0–100 numerical score |
| Plain-English TL;DR | One-paragraph summary |
| Pillar breakdown | 6 categories with verbatim citations highlighted in the document |
| ELI5 mode | Legal jargon translated to plain English |
| Rating | Score Range | Condition |
|---|---|---|
| SAFE | 90–100 | No violations |
| OKAY | 50–89 | Minor issues only (e.g., vague transparency) |
| RISKY | 0–49 | One or more serious violations detected |
Penalty weights: Dark patterns −40 pts, AI training / data selling / data retention / content ownership −30 pts each, Transparency −20 pts.
| Basic Scan | Deep Scan | |
|---|---|---|
| Model | Flash only | Flash + Flash-Lite ensemble |
| Accuracy | 22/25 | 25/25 |
| Recall | 79% | 93% |
| Precision | 89% | 94% |
| Latency | ~12s | ~25s |
| Output | Rating + score + TL;DR | Full pillar breakdown + verbatim citations |
| Layer | Technology |
|---|---|
| Chrome Extension | Manifest V3, Vanilla JavaScript |
| Backend | Node.js, Express, TypeScript |
| AI Models | Google Gemini 2.5 Flash / Flash-Lite |
| NLP Chunking | compromise (sentence-aware splitting) |
| Auth and Database | Firebase Auth + Firestore |
| Cache | In-memory LRU (L1) + Firestore shared cache (L2) |
| Deployment | Google Cloud Run |
| Web App | React 19, Tailwind CSS 4 |
| Content Extraction | @mozilla/readability |
git clone https://github.com/Jatin23K/TLDR-Shield.git
cd TLDR-Shield
npm installCreate a .env file:
GEMINI_SCAN_KEY_1=AIza...
GEMINI_SCAN_KEY_2=AIza...
GEMINI_SCAN_KEY_3=AIza...npm run dev # Express + Vite on :3000
npm run build # Production build
npm run lint # TypeScript type-checkChrome Extension (unpacked):
- Open
chrome://extensions/ - Enable Developer mode
- Click Load unpacked → select the
extension/folder - Enter your backend URL in the popup → Save
- data_selling precision gap: The Privacy Policy scan flags "marketing partners" language that sometimes refers to service providers rather than third-party data buyers. A supervised classifier trained on labeled examples of service-provider vs. data-broker language would reduce false positives.
- Document length cap: Documents above the chunk window are truncated. Multi-chunk scanning with semantic ranking would improve recall on very long policies (PayPal ToS: 120K chars, Apple ToS: 120K chars).
- Sample size: 25 services gives reliable directional estimates; precision/recall confidence intervals are ±8–10%. Expanding to 50+ services would tighten these estimates.
- Grade A/B coverage: All 25 services are Grade C–F (RISKY). The true-negative rate (6/6) was measured separately on Grade A+B services, but a larger clean-service benchmark would improve confidence.
Built with care for privacy.




