Skip to content

Gaur-Ayush-AI-Engineer/freight-bill-processor

Repository files navigation

Freight Bill Processing System

An AI-powered backend that ingests freight carrier invoices, validates them against contracts and delivery records, and makes approval/dispute decisions with a confidence score. Ambiguous cases are paused for human review via a LangGraph interrupt/resume pattern.


Quick Start

cp .env.example .env
# Add your OPENAI_API_KEY to .env

docker compose up --build

The API will be at http://localhost:8000. Swagger UI at http://localhost:8000/docs. Seed data loads automatically on first startup.


API Endpoints

Submit a freight bill

curl -X POST http://localhost:8000/freight-bills \
  -H "Content-Type: application/json" \
  -d '{
    "id": "FB-2025-101",
    "carrier_name": "Safexpress Logistics",
    "carrier_external_id": "CAR001",
    "bill_number": "SFX/2025/00234",
    "bill_date": "2025-02-15",
    "shipment_reference": "SHP-2025-002",
    "lane": "DEL-BLR",
    "billed_weight_kg": 850,
    "rate_per_kg": 15.00,
    "base_charge": 12750.00,
    "fuel_surcharge": 1020.00,
    "gst_amount": 2479.00,
    "total_amount": 16249.00,
    "billing_unit": "kg"
  }'

Check bill status and decision

curl http://localhost:8000/freight-bills/FB-2025-101

View review queue

curl http://localhost:8000/review-queue

Submit human review decision

curl -X POST http://localhost:8000/review/FB-2025-102 \
  -H "Content-Type: application/json" \
  -d '{"decision": "approve", "reviewer_id": "ops_team", "notes": "CC-2025-SFX-003 applies"}'

Health check

curl http://localhost:8000/health

Running Tests

# Activate venv
source .venv/bin/activate

# Unit tests only — no Docker, no API calls needed
pytest tests/ -m "not live and not integration" -v

# Live OpenAI tests — needs OPENAI_API_KEY in .env
pytest tests/ -m live -v

# Integration tests — needs docker compose up running
pytest tests/ -m integration -v

# Everything
pytest tests/ -v

Schema Design

Why this structure:

The relational schema separates entities that change at different rates. Carriers and contracts are relatively stable; shipments and BOLs are transactional. The billed_weight_ledger table is the key to detecting over-billing across multiple bills for the same shipment — it accumulates billed weight per shipment so any new bill can instantly check how much has already been invoiced.

freight_bills stores both raw input (what the carrier claimed) and resolved state (what the agent concluded), with evidence_chain as a JSONB blob for the full audit trail.

Tables:

  • carriers — canonical carrier records with aliases for fuzzy name matching
  • carrier_contracts — one row per lane within a contract (multi-lane contracts are split into sibling rows)
  • shipments — delivery orders with expected total weight
  • bills_of_lading — actual delivery confirmation with received weight
  • freight_bills — incoming invoices, enriched with agent decisions
  • billed_weight_ledger — running sum of approved/claimed weight per shipment
  • review_events — immutable audit log of every state transition

Carrier Name Matching

Carrier names on real invoices are messy. The system resolves them in order:

  1. Carrier external ID provided directly — skip matching, look up by external_id. Instant.
  2. Fuzzy match against aliases — each carrier has a list of known alternate names (e.g. "SFX", "Safe Express", "Safexpress Pvt Ltd"). rapidfuzz scores the invoice name against all aliases. If score ≥ 85 → matched. No API call.
  3. LLM fallback — if fuzzy fails, GPT-4o-mini is given the raw name and the list of known carriers. It reasons about which carrier it could be.
  4. Human review — if LLM also fails or returns no match, the bill goes to the human review queue. A reviewer manually resolves it.

LLM can only return a name that exists in your carrier list — it cannot hallucinate a new carrier. So the worst case is human review, not a wrong match.


Graph Model

Why NetworkX:

The relationships between carriers, contracts, lanes, shipments, and BOLs form a natural graph. The core operation is traversal: "given this carrier and lane, find all valid contracts for this bill date." NetworkX MultiDiGraph models multiple overlapping contracts between the same carrier and lane as parallel edges — which is exactly the ambiguous case that triggers human review.

The graph is built from the database on startup and held in memory as a singleton. It is fast (pure Python dict lookups), requires no additional infrastructure, and is trivially rebuilt when contracts change.

Node types: carrier:{external_id}, contract:{contract_number}, shipment:{shipment_ref}, bol:{bol_ref}

Edge types: HAS_CONTRACT, HANDLES_SHIPMENT, GOVERNS_SHIPMENT, HAS_BOL


Agent Design (LangGraph)

The agent is a StateGraph with 11 nodes. State is persisted to PostgreSQL via langgraph-checkpoint-postgres, which enables the interrupt/resume pattern.

normalize_carrier → check_duplicate → resolve_entities → select_contract
  → run_validation → compute_confidence → make_decision → generate_explanation
  → [if low confidence or ambiguous] human_review_interrupt
  → process_human_decision → generate_explanation → persist_decision

Key design choices:

  • Duplicate check before validation — avoids running expensive validation + LLM on a bill that will be rejected anyway
  • select_contract is a decision gate — zero active contracts → dispute; multiple → interrupt; one → proceed
  • LLM only at two points — carrier normalization (when fuzzy fails) and decision explanation (always). Everything else is deterministic.

Confidence Scoring

The score is a weighted sum of five components, then multiplied by penalty factors:

Component Weight What it measures
Entity match 35% How many of carrier/contract/shipment/BOL were matched
Rate accuracy 25% How close billed rate is to contracted rate (5% tolerance)
Weight accuracy 20% Whether billed weight ratio to shipment is 0.95–1.05
Date validity 15% Whether bill date falls within contract window
Duplicate risk 5% Whether a prior bill with same number exists

Penalty multipliers (applied multiplicatively):

  • Unit type mismatch: ×0.70
  • Fuel surcharge wrong: ×0.85
  • No shipment ref AND no BOL: ×0.80
  • LLM normalization used: ×0.90

Decision thresholds:

  • ≥ 0.85, no warnings → auto_approve
  • < 0.85, or warnings present → human_review_interrupt (agent pauses)
  • Any blocker rule → dispute
  • Unknown carrier or no contract → human_review_interrupt
  • Duplicate detected → reject

Human-in-the-Loop

When the agent reaches human_review_interrupt, it calls langgraph.interrupt() which:

  1. Updates freight_bills.status = 'human_review_pending' in the DB
  2. Logs a ReviewEvent with reason and confidence score
  3. Serializes the full graph state to PostgreSQL (via the checkpointer)
  4. Pauses execution — the bill is now in the review queue

POST /review/{id} calls runner.resume_agent() which:

  1. Looks up the langgraph_thread_id from the DB
  2. Calls graph.ainvoke(Command(resume=payload, update=payload), config={"thread_id": thread_id})
  3. The graph resumes at process_human_decision with the reviewer's decision in state
  4. Flows through generate_explanationpersist_decision → final status written to DB

This is a real interrupt/resume — the agent state is genuinely suspended between calls, not simulated with a status flag.

Note on auto_approved status: Both agent-approved and human-approved bills end up with status: "auto_approved". This just means "cleared for payment." The audit trail (reviewer_id, reviewer_decision, reviewed_at) tells you whether a human was involved.


What I'd Do Differently With More Time

  1. Contract disambiguation heuristic — currently interrupts for human review when multiple contracts overlap. A smarter approach: prefer the contract referenced by the shipment's own contract_id, fall back to human only if still ambiguous.

  2. Task queue — the agent runs as a FastAPI background task, meaning in-flight runs don't survive server restarts. Celery or Cloud Tasks would decouple ingestion from processing.

  3. Graph hot-reload — the in-memory graph is built once on startup. A POST /admin/rebuild-graph endpoint would let contract updates propagate without a restart.

  4. Structured audit trailevidence_chain is a free-form JSONB blob. A normalized audit table with typed columns would make analytics easier (e.g. "how many bills had rate drift > 5% this month?").

  5. GCP deployment — Cloud Run for the API, Cloud SQL for Postgres, Secret Manager for the API key. The docker-compose setup makes this straightforward.

About

AI-powered freight bill processing backend built with FastAPI, PostgreSQL, LangGraph, and NetworkX. It validates carrier invoices against contracts, shipments, and bills of lading, auto-approves clean cases, flags disputes, and supports human-in-the-loop review for ambiguous scenarios.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors