ProofLens

Visual damage claim verification — HackerRank Orchestrate June 2026

ProofLens decides whether submitted photos support, contradict, or provide not enough information for a reported damage claim. It processes cars, laptops, and packages through a 10-component multi-agent pipeline that separates image understanding from verdict logic — preventing the hallucinated decisions that plague single-model approaches.

How it works

Each claim passes through a strict sequence of deterministic and vision stages. The key constraint: no model ever outputs a verdict. Models only describe what they see. A pure-rules engine makes every decision.

claims.csv row
      │
      ▼
┌─────────────────────────────────────────────┐
│  Layer 1 — Deterministic pre-processing     │
│                                             │
│  Signal detector   →  injection / threat /  │
│                        language flags       │
│  Taxonomy normalizer  →  VLM vocab → schema │
└──────────────────────┬──────────────────────┘
                       │
                       ▼
           Agent 1 — Hybrid claim parser
           Regex fast path for simple English;
           Gemini 2.5 Flash fallback for
           multilingual, multi-part, or
           ambiguous claims
                       │
                       ▼
           Agent 2 — Evidence requirement lookup
           Reads evidence_requirements.csv;
           sets the bar before any image is seen
                       │
             ┌─────────┴─────────┐
             │   Per image (parallel)   │
             ▼                   ▼
        Agent 3             Agent 4
    Vision evidence      Image quality
    "What is visible?"   "Is this usable?"
    Gemini 2.5 Flash     Gemini 2.5 Flash
    OpenCV gate: corrupt / extreme-blur →
    skip both agents entirely
             └─────────┬─────────┘
                       │
                       ▼
           Agent 5 — Deterministic fusion
           Aggregates findings across all images;
           computes evidence_coverage_score
                       │
                       ▼
           Agent 5b — Object-part validator
           Catches impossible combos
           e.g. (car, keyboard) → unknown
                       │
                       ▼
           Agent 6 — History risk
           Reads user_history.csv;
           adds risk_flags only —
           never touches claim_status
                       │
                       ▼
           Agent 7 — Decision engine (zero LLM)
           evidence not met   → not_enough_information
           part + damage match → supported
           part visible, no match → contradicted
                       │
                       ▼
           Agent 8 — Audit & recovery
           7 named consistency rules;
           targeted agent re-run on failure —
           never restarts the full pipeline
                       │
                       ▼
           Layer 5 — CSV formatter
           14-column schema enforcement;
           allowed-value hard gate
                       │
                       ▼
              output.csv row

Design decisions

Decision	Why
VLM asks only "what is visible?" — never "is this claim valid?"	Keeps verdict logic inside the deterministic rule engine, eliminating hallucinated decisions
Agent 1 is a hybrid: regex fast path + LLM fallback	Simple English claims never reach Gemini, cutting ~50% of Agent 1 API calls
OpenCV pre-checks gate every VLM call	Corrupt, too-small, and extreme-blur images are rejected before Gemini is invoked, saving 20–30% of vision calls
Agent 5 (fusion) is pure Python, not an LLM	Aggregation is logic, not reasoning — deterministic, cheap, and trivially testable
Agent 6 (history risk) writes only `risk_flags`	User history adds context but cannot reverse clear visual evidence; matches the problem spec exactly
Agent 8 re-runs individual agents, not the pipeline	Targeted recovery is faster and cheaper than a full restart
Confidence score on every agent output	Audit agent triggers re-run when decision confidence falls below 0.65
Temperature 0.1 on all LLM calls	Produces consistent, structured JSON; minimises output variance across runs

Repository layout

ProofLens/
├── context.md                     ← architecture log, updated after each phase
├── output.csv                     ← final predictions (44 rows, 14 columns)
├── dataset/
│   ├── claims.csv
│   ├── sample_claims.csv
│   ├── user_history.csv
│   ├── evidence_requirements.csv
│   └── images/
│       ├── sample/
│       └── test/
└── code/
    ├── main.py                    ← pipeline entry point
    ├── requirements.txt
    ├── .env.example
    ├── agents/
    │   ├── claim_parser.py        ← Agent 1: hybrid regex + Gemini
    │   ├── evidence_requirement.py← Agent 2: CSV lookup
    │   ├── vision_evidence.py     ← Agent 3: VLM per image
    │   ├── image_quality.py       ← Agent 4: VLM per image
    │   ├── cross_image_fusion.py  ← Agent 5: deterministic aggregation
    │   ├── object_part_validator.py← Agent 5b: schema guard
    │   ├── history_risk.py        ← Agent 6: risk flags only
    │   ├── decision_engine.py     ← Agent 7: pure rules
    │   ├── audit_recovery.py      ← Agent 8: consistency + targeted re-run
    │   └── csv_formatter.py       ← Layer 5
    ├── core/
    │   ├── config.py              ← constants, thresholds, allowed values
    │   ├── models.py              ← Pydantic schemas (confidence on every output)
    │   ├── loader.py
    │   ├── signal_detector.py     ← injection / threat / language detection
    │   ├── taxonomy.py            ← 59-entry VLM vocab → schema normaliser
    │   ├── openrouter.py          ← API wrapper with retry + semaphore
    │   └── precheck.py            ← OpenCV blur / brightness / corruption checks
    ├── tests/
    │   ├── test_core.py           49 tests
    │   ├── test_agent1.py         6 tests
    │   ├── test_agents_3_4.py     10 tests
    │   ├── test_agents_5_6.py     13 tests
    │   ├── test_agents_7_8.py     6 tests
    │   ├── test_pipeline_e2e.py   28 tests
    │   └── test_evaluation.py     28 tests
    └── evaluation/
        ├── main.py
        ├── metrics.py
        └── evaluation_report.md

Setup

Requirements: Python 3.11+, an OpenRouter API key with access to google/gemini-2.5-flash

# Install dependencies
pip install -r code/requirements.txt

# Configure API key
cp code/.env.example code/.env
# Add OPENROUTER_API_KEY to code/.env

Running

Full pipeline — reads dataset/claims.csv, writes output.csv:

python -m code.main

Evaluation — runs the pipeline against sample_claims.csv and writes code/evaluation/evaluation_report.md:

python -m code.evaluation.main --real       # requires images + API key
python -m code.evaluation.main --synthetic  # offline, reproducible baseline

Tests — 140 tests total, all passing:

PYTHONPATH=code python -m pytest code/tests/ -v

Output schema

The 14 required output columns, in exact submission order:

Column	Values
`user_id`	Claimant identifier
`image_paths`	Semicolon-separated input paths
`user_claim`	Raw conversation transcript
`claim_object`	`car` · `laptop` · `package`
`evidence_standard_met`	`true` · `false`
`evidence_standard_met_reason`	One-sentence explanation
`risk_flags`	Semicolon-separated flags, or `none`
`issue_type`	`dent` · `scratch` · `crack` · `glass_shatter` · `broken_part` · `missing_part` · `torn_packaging` · `crushed_packaging` · `water_damage` · `stain` · `none` · `unknown`
`object_part`	Object-specific part name, or `unknown`
`claim_status`	`supported` · `contradicted` · `not_enough_information`
`claim_status_justification`	Image-grounded explanation, references image IDs
`supporting_image_ids`	Semicolon-separated image IDs, or `none`
`valid_image`	`true` · `false`
`severity`	`none` · `low` · `medium` · `high` · `unknown`

Special signal handling

The pipeline handles four categories of adversarial or unusual input that appear in the test set. All are detected before any LLM call.

Signal	Detection method	Handling
Prompt injection in `user_claim`	`SignalDetector` regex, runs pre-LLM	Sets `text_instruction_present` risk flag; real claim still extracted normally
Multilingual claims (Hindi, Spanish, Chinese, mixed)	Language keyword scoring in `SignalDetector`	Routes Agent 1 to Gemini fallback; no special handling otherwise
Escalation threats ("will keep reopening", "escalate publicly")	Threat-pattern regex	Sets `manual_review_required` flag; verdict is never biased
Instructions embedded inside an image	Agents 3 + 4 vision analysis	Flagged as `text_instruction_present`; instruction content ignored for evidence

Cost and latency

Metric	Sample — 20 rows	Full test — 44 rows
Agent 1 LLM calls (fallback only)	~5–10	~10–22
Vision API calls (Agents 3 + 4, parallel)	~30–40	~66–88
Images skipped by OpenCV gate	~20–30%	~20–30%
Estimated cost at Gemini 2.5 Flash pricing	~$0.009	~$0.020
Estimated wall-clock runtime	2–3 min	5–7 min

Rate-limit strategy: asyncio.Semaphore(5) caps concurrent OpenRouter requests. All calls retry up to 3 times with exponential backoff (2 s → 4 s → 8 s) on HTTP 429, 500, 502, and 503.

Evaluation results

Results on sample_claims.csv in offline mode (images not available locally):

Field	Accuracy	Note
`claim_status`	0.15	All rows default to `not_enough_information` without images
`issue_type`	0.15	Same — requires visual evidence
`object_part`	0.75	Agent 1 extracts the claimed part from text alone, regardless of images
`severity`	0.15	Defaults to `unknown` without visual input
`evidence_standard_met`	0.15	All `false` without images
`valid_image`	0.15	All `false` without images

Full accuracy figures are produced in the HackerRank sandbox where dataset/images/ is present and Gemini vision analysis runs against real images.

Pipeline visualiser (UI)

An interactive web interface that streams each agent's output in real time as a claim is processed. Built with Next.js 16 and FastAPI, deployed on Vercel and Render.

Each pipeline step emits a Server-Sent Event on completion:

{
  "type": "step_complete",
  "step": "claim_parser",
  "duration_ms": 1200,
  "data": {
    "claimed_issue": "dent",
    "claimed_part": "rear_bumper",
    "path": "llm_fallback"
  }
}

Run locally:

# Backend
pip install -r code/requirements.txt -r api/requirements.txt
uvicorn api.main:app --reload --port 8000

# Frontend
cd ui && cp .env.local.example .env.local
# Set NEXT_PUBLIC_API_URL=http://localhost:8000
npm install && npm run dev

Deploy:

Service	Steps
Render (backend)	Connect repo → `render.yaml` is auto-detected → add `OPENROUTER_API_KEY` as a secret env var
Vercel (frontend)	Import repo → `vercel.json` sets `rootDirectory: ui` → add `NEXT_PUBLIC_API_URL` pointing to your Render URL → deploy

Submission

File	Description
`output.csv`	Predictions for all 44 rows in `claims.csv`
`code.zip`	Full runnable solution including `evaluation/` folder
`code/evaluation/evaluation_report.md`	Per-field accuracy, strategy comparison, full operational analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProofLens

How it works

Design decisions

Repository layout

Setup

Running

Output schema

Special signal handling

Cost and latency

Evaluation results

Pipeline visualiser (UI)

Submission

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
api		api
code		code
dataset		dataset
ui		ui
.gitignore		.gitignore
README.md		README.md
code.zip		code.zip
context.md		context.md
output.csv		output.csv
render.yaml		render.yaml
vercel.json		vercel.json

Folders and files

Latest commit

History

Repository files navigation

ProofLens

How it works

Design decisions

Repository layout

Setup

Running

Output schema

Special signal handling

Cost and latency

Evaluation results

Pipeline visualiser (UI)

Submission

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages