Freight Bill Processing System

An AI-powered backend that ingests freight carrier invoices, validates them against contracts and delivery records, and makes approval/dispute decisions with a confidence score. Ambiguous cases are paused for human review via a LangGraph interrupt/resume pattern.

Quick Start

cp .env.example .env
# Add your OPENAI_API_KEY to .env

docker compose up --build

The API will be at http://localhost:8000. Swagger UI at http://localhost:8000/docs. Seed data loads automatically on first startup.

API Endpoints

Submit a freight bill

curl -X POST http://localhost:8000/freight-bills \
  -H "Content-Type: application/json" \
  -d '{
    "id": "FB-2025-101",
    "carrier_name": "Safexpress Logistics",
    "carrier_external_id": "CAR001",
    "bill_number": "SFX/2025/00234",
    "bill_date": "2025-02-15",
    "shipment_reference": "SHP-2025-002",
    "lane": "DEL-BLR",
    "billed_weight_kg": 850,
    "rate_per_kg": 15.00,
    "base_charge": 12750.00,
    "fuel_surcharge": 1020.00,
    "gst_amount": 2479.00,
    "total_amount": 16249.00,
    "billing_unit": "kg"
  }'

Check bill status and decision

curl http://localhost:8000/freight-bills/FB-2025-101

View review queue

curl http://localhost:8000/review-queue

Submit human review decision

curl -X POST http://localhost:8000/review/FB-2025-102 \
  -H "Content-Type: application/json" \
  -d '{"decision": "approve", "reviewer_id": "ops_team", "notes": "CC-2025-SFX-003 applies"}'

Health check

curl http://localhost:8000/health

Running Tests

# Activate venv
source .venv/bin/activate

# Unit tests only — no Docker, no API calls needed
pytest tests/ -m "not live and not integration" -v

# Live OpenAI tests — needs OPENAI_API_KEY in .env
pytest tests/ -m live -v

# Integration tests — needs docker compose up running
pytest tests/ -m integration -v

# Everything
pytest tests/ -v

Schema Design

Why this structure:

The relational schema separates entities that change at different rates. Carriers and contracts are relatively stable; shipments and BOLs are transactional. The billed_weight_ledger table is the key to detecting over-billing across multiple bills for the same shipment — it accumulates billed weight per shipment so any new bill can instantly check how much has already been invoiced.

freight_bills stores both raw input (what the carrier claimed) and resolved state (what the agent concluded), with evidence_chain as a JSONB blob for the full audit trail.

Tables:

carriers — canonical carrier records with aliases for fuzzy name matching
carrier_contracts — one row per lane within a contract (multi-lane contracts are split into sibling rows)
shipments — delivery orders with expected total weight
bills_of_lading — actual delivery confirmation with received weight
freight_bills — incoming invoices, enriched with agent decisions
billed_weight_ledger — running sum of approved/claimed weight per shipment
review_events — immutable audit log of every state transition

Carrier Name Matching

Carrier names on real invoices are messy. The system resolves them in order:

Carrier external ID provided directly — skip matching, look up by external_id. Instant.
Fuzzy match against aliases — each carrier has a list of known alternate names (e.g. "SFX", "Safe Express", "Safexpress Pvt Ltd"). rapidfuzz scores the invoice name against all aliases. If score ≥ 85 → matched. No API call.
LLM fallback — if fuzzy fails, GPT-4o-mini is given the raw name and the list of known carriers. It reasons about which carrier it could be.
Human review — if LLM also fails or returns no match, the bill goes to the human review queue. A reviewer manually resolves it.

LLM can only return a name that exists in your carrier list — it cannot hallucinate a new carrier. So the worst case is human review, not a wrong match.

Graph Model

Why NetworkX:

The relationships between carriers, contracts, lanes, shipments, and BOLs form a natural graph. The core operation is traversal: "given this carrier and lane, find all valid contracts for this bill date." NetworkX MultiDiGraph models multiple overlapping contracts between the same carrier and lane as parallel edges — which is exactly the ambiguous case that triggers human review.

The graph is built from the database on startup and held in memory as a singleton. It is fast (pure Python dict lookups), requires no additional infrastructure, and is trivially rebuilt when contracts change.

Node types: carrier:{external_id}, contract:{contract_number}, shipment:{shipment_ref}, bol:{bol_ref}

Edge types: HAS_CONTRACT, HANDLES_SHIPMENT, GOVERNS_SHIPMENT, HAS_BOL

Agent Design (LangGraph)

The agent is a StateGraph with 11 nodes. State is persisted to PostgreSQL via langgraph-checkpoint-postgres, which enables the interrupt/resume pattern.

normalize_carrier → check_duplicate → resolve_entities → select_contract
  → run_validation → compute_confidence → make_decision → generate_explanation
  → [if low confidence or ambiguous] human_review_interrupt
  → process_human_decision → generate_explanation → persist_decision

Key design choices:

Duplicate check before validation — avoids running expensive validation + LLM on a bill that will be rejected anyway
select_contract is a decision gate — zero active contracts → dispute; multiple → interrupt; one → proceed
LLM only at two points — carrier normalization (when fuzzy fails) and decision explanation (always). Everything else is deterministic.

Confidence Scoring

The score is a weighted sum of five components, then multiplied by penalty factors:

Component	Weight	What it measures
Entity match	35%	How many of carrier/contract/shipment/BOL were matched
Rate accuracy	25%	How close billed rate is to contracted rate (5% tolerance)
Weight accuracy	20%	Whether billed weight ratio to shipment is 0.95–1.05
Date validity	15%	Whether bill date falls within contract window
Duplicate risk	5%	Whether a prior bill with same number exists

Penalty multipliers (applied multiplicatively):

Unit type mismatch: ×0.70
Fuel surcharge wrong: ×0.85
No shipment ref AND no BOL: ×0.80
LLM normalization used: ×0.90

Decision thresholds:

≥ 0.85, no warnings → auto_approve
< 0.85, or warnings present → human_review_interrupt (agent pauses)
Any blocker rule → dispute
Unknown carrier or no contract → human_review_interrupt
Duplicate detected → reject

Human-in-the-Loop

When the agent reaches human_review_interrupt, it calls langgraph.interrupt() which:

Updates freight_bills.status = 'human_review_pending' in the DB
Logs a ReviewEvent with reason and confidence score
Serializes the full graph state to PostgreSQL (via the checkpointer)
Pauses execution — the bill is now in the review queue

POST /review/{id} calls runner.resume_agent() which:

Looks up the langgraph_thread_id from the DB
Calls graph.ainvoke(Command(resume=payload, update=payload), config={"thread_id": thread_id})
The graph resumes at process_human_decision with the reviewer's decision in state
Flows through generate_explanation → persist_decision → final status written to DB

This is a real interrupt/resume — the agent state is genuinely suspended between calls, not simulated with a status flag.

Note on auto_approved status: Both agent-approved and human-approved bills end up with status: "auto_approved". This just means "cleared for payment." The audit trail (reviewer_id, reviewer_decision, reviewed_at) tells you whether a human was involved.

What I'd Do Differently With More Time

Contract disambiguation heuristic — currently interrupts for human review when multiple contracts overlap. A smarter approach: prefer the contract referenced by the shipment's own contract_id, fall back to human only if still ambiguous.
Task queue — the agent runs as a FastAPI background task, meaning in-flight runs don't survive server restarts. Celery or Cloud Tasks would decouple ingestion from processing.
Graph hot-reload — the in-memory graph is built once on startup. A POST /admin/rebuild-graph endpoint would let contract updates propagate without a restart.
Structured audit trail — evidence_chain is a free-form JSONB blob. A normalized audit table with typed columns would make analytics easier (e.g. "how many bills had rate drift > 5% this month?").
GCP deployment — Cloud Run for the API, Cloud SQL for Postgres, Secret Manager for the API key. The docker-compose setup makes this straightforward.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
alembic		alembic
app		app
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
seed_data_logistics.json		seed_data_logistics.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Freight Bill Processing System

Quick Start

API Endpoints

Submit a freight bill

Check bill status and decision

View review queue

Submit human review decision

Health check

Running Tests

Schema Design

Carrier Name Matching

Graph Model

Agent Design (LangGraph)

Confidence Scoring

Human-in-the-Loop

What I'd Do Differently With More Time

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Freight Bill Processing System

Quick Start

API Endpoints

Submit a freight bill

Check bill status and decision

View review queue

Submit human review decision

Health check

Running Tests

Schema Design

Carrier Name Matching

Graph Model

Agent Design (LangGraph)

Confidence Scoring

Human-in-the-Loop

What I'd Do Differently With More Time

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages