An AI-powered backend that ingests freight carrier invoices, validates them against contracts and delivery records, and makes approval/dispute decisions with a confidence score. Ambiguous cases are paused for human review via a LangGraph interrupt/resume pattern.
cp .env.example .env
# Add your OPENAI_API_KEY to .env
docker compose up --buildThe API will be at http://localhost:8000. Swagger UI at http://localhost:8000/docs. Seed data loads automatically on first startup.
curl -X POST http://localhost:8000/freight-bills \
-H "Content-Type: application/json" \
-d '{
"id": "FB-2025-101",
"carrier_name": "Safexpress Logistics",
"carrier_external_id": "CAR001",
"bill_number": "SFX/2025/00234",
"bill_date": "2025-02-15",
"shipment_reference": "SHP-2025-002",
"lane": "DEL-BLR",
"billed_weight_kg": 850,
"rate_per_kg": 15.00,
"base_charge": 12750.00,
"fuel_surcharge": 1020.00,
"gst_amount": 2479.00,
"total_amount": 16249.00,
"billing_unit": "kg"
}'curl http://localhost:8000/freight-bills/FB-2025-101curl http://localhost:8000/review-queuecurl -X POST http://localhost:8000/review/FB-2025-102 \
-H "Content-Type: application/json" \
-d '{"decision": "approve", "reviewer_id": "ops_team", "notes": "CC-2025-SFX-003 applies"}'curl http://localhost:8000/health# Activate venv
source .venv/bin/activate
# Unit tests only — no Docker, no API calls needed
pytest tests/ -m "not live and not integration" -v
# Live OpenAI tests — needs OPENAI_API_KEY in .env
pytest tests/ -m live -v
# Integration tests — needs docker compose up running
pytest tests/ -m integration -v
# Everything
pytest tests/ -vWhy this structure:
The relational schema separates entities that change at different rates. Carriers and contracts are relatively stable; shipments and BOLs are transactional. The billed_weight_ledger table is the key to detecting over-billing across multiple bills for the same shipment — it accumulates billed weight per shipment so any new bill can instantly check how much has already been invoiced.
freight_bills stores both raw input (what the carrier claimed) and resolved state (what the agent concluded), with evidence_chain as a JSONB blob for the full audit trail.
Tables:
carriers— canonical carrier records with aliases for fuzzy name matchingcarrier_contracts— one row per lane within a contract (multi-lane contracts are split into sibling rows)shipments— delivery orders with expected total weightbills_of_lading— actual delivery confirmation with received weightfreight_bills— incoming invoices, enriched with agent decisionsbilled_weight_ledger— running sum of approved/claimed weight per shipmentreview_events— immutable audit log of every state transition
Carrier names on real invoices are messy. The system resolves them in order:
- Carrier external ID provided directly — skip matching, look up by
external_id. Instant. - Fuzzy match against aliases — each carrier has a list of known alternate names (e.g. "SFX", "Safe Express", "Safexpress Pvt Ltd"). rapidfuzz scores the invoice name against all aliases. If score ≥ 85 → matched. No API call.
- LLM fallback — if fuzzy fails, GPT-4o-mini is given the raw name and the list of known carriers. It reasons about which carrier it could be.
- Human review — if LLM also fails or returns no match, the bill goes to the human review queue. A reviewer manually resolves it.
LLM can only return a name that exists in your carrier list — it cannot hallucinate a new carrier. So the worst case is human review, not a wrong match.
Why NetworkX:
The relationships between carriers, contracts, lanes, shipments, and BOLs form a natural graph. The core operation is traversal: "given this carrier and lane, find all valid contracts for this bill date." NetworkX MultiDiGraph models multiple overlapping contracts between the same carrier and lane as parallel edges — which is exactly the ambiguous case that triggers human review.
The graph is built from the database on startup and held in memory as a singleton. It is fast (pure Python dict lookups), requires no additional infrastructure, and is trivially rebuilt when contracts change.
Node types: carrier:{external_id}, contract:{contract_number}, shipment:{shipment_ref}, bol:{bol_ref}
Edge types: HAS_CONTRACT, HANDLES_SHIPMENT, GOVERNS_SHIPMENT, HAS_BOL
The agent is a StateGraph with 11 nodes. State is persisted to PostgreSQL via langgraph-checkpoint-postgres, which enables the interrupt/resume pattern.
normalize_carrier → check_duplicate → resolve_entities → select_contract
→ run_validation → compute_confidence → make_decision → generate_explanation
→ [if low confidence or ambiguous] human_review_interrupt
→ process_human_decision → generate_explanation → persist_decision
Key design choices:
- Duplicate check before validation — avoids running expensive validation + LLM on a bill that will be rejected anyway
select_contractis a decision gate — zero active contracts → dispute; multiple → interrupt; one → proceed- LLM only at two points — carrier normalization (when fuzzy fails) and decision explanation (always). Everything else is deterministic.
The score is a weighted sum of five components, then multiplied by penalty factors:
| Component | Weight | What it measures |
|---|---|---|
| Entity match | 35% | How many of carrier/contract/shipment/BOL were matched |
| Rate accuracy | 25% | How close billed rate is to contracted rate (5% tolerance) |
| Weight accuracy | 20% | Whether billed weight ratio to shipment is 0.95–1.05 |
| Date validity | 15% | Whether bill date falls within contract window |
| Duplicate risk | 5% | Whether a prior bill with same number exists |
Penalty multipliers (applied multiplicatively):
- Unit type mismatch: ×0.70
- Fuel surcharge wrong: ×0.85
- No shipment ref AND no BOL: ×0.80
- LLM normalization used: ×0.90
Decision thresholds:
- ≥ 0.85, no warnings →
auto_approve - < 0.85, or warnings present →
human_review_interrupt(agent pauses) - Any blocker rule →
dispute - Unknown carrier or no contract →
human_review_interrupt - Duplicate detected →
reject
When the agent reaches human_review_interrupt, it calls langgraph.interrupt() which:
- Updates
freight_bills.status = 'human_review_pending'in the DB - Logs a
ReviewEventwith reason and confidence score - Serializes the full graph state to PostgreSQL (via the checkpointer)
- Pauses execution — the bill is now in the review queue
POST /review/{id} calls runner.resume_agent() which:
- Looks up the
langgraph_thread_idfrom the DB - Calls
graph.ainvoke(Command(resume=payload, update=payload), config={"thread_id": thread_id}) - The graph resumes at
process_human_decisionwith the reviewer's decision in state - Flows through
generate_explanation→persist_decision→ final status written to DB
This is a real interrupt/resume — the agent state is genuinely suspended between calls, not simulated with a status flag.
Note on auto_approved status: Both agent-approved and human-approved bills end up with status: "auto_approved". This just means "cleared for payment." The audit trail (reviewer_id, reviewer_decision, reviewed_at) tells you whether a human was involved.
-
Contract disambiguation heuristic — currently interrupts for human review when multiple contracts overlap. A smarter approach: prefer the contract referenced by the shipment's own
contract_id, fall back to human only if still ambiguous. -
Task queue — the agent runs as a FastAPI background task, meaning in-flight runs don't survive server restarts. Celery or Cloud Tasks would decouple ingestion from processing.
-
Graph hot-reload — the in-memory graph is built once on startup. A
POST /admin/rebuild-graphendpoint would let contract updates propagate without a restart. -
Structured audit trail —
evidence_chainis a free-form JSONB blob. A normalized audit table with typed columns would make analytics easier (e.g. "how many bills had rate drift > 5% this month?"). -
GCP deployment — Cloud Run for the API, Cloud SQL for Postgres, Secret Manager for the API key. The docker-compose setup makes this straightforward.