Skip to content

JMADIL/HackAI-2026-Project

 
 

Repository files navigation

⚖️ Mizan — AI Legal Assistant for Morocco

Voice-first, Darija-native legal guidance for underserved Moroccan communities
Free-tier stack · Runs on MacBook M4 · Works on 3G and offline


What makes this different from a chatbot

Most legal AI products are RAG wrappers: user asks a question, the system retrieves chunks, an LLM writes an answer. Mizan has three architectural properties that separate it from that pattern.

1. Structured tool-use output — the LLM cannot produce a free-text answer. It is forced, by tool schema, to emit a typed JSON object containing the answer in Darija, an array of article citations with grounding claims, a confidence score, and a boolean flag for whether a real lawyer is recommended. No regex. No post-hoc parsing. Structured by design.

2. Multi-agent confidence debate — every answer passes through three sequential LLM calls: a primary agent that drafts, a devil's advocate that scores each claim as grounded / hedged / not_in_context, and a synthesis agent that removes unsupported claims and produces a final confidence score shown to the user as a badge.

3. User mental model — a lightweight profile stored per user tracks literacy level, wilaya, topics asked, and low-confidence interaction count. The answer formatter reads this profile and adjusts Darija register, sentence length, and vocabulary complexity. The system gets better at talking to each person individually over time.


Free Stack — Zero Cost, Zero GPU

Every component runs on free tiers or locally on Apple Silicon. No billing required to build or demo this project.

Layer Technology Cost Why
Mobile React Native + Expo Free Single codebase, Arabic RTL, fast iteration
Backend FastAPI + Uvicorn Free Async WebSocket, lightweight
Orchestration Plain Python state machine Free Debuggable, no framework lock-in
Extraction LLM Llama 3.3 (via Groq) Free Fast intent extraction & structured function calling
Legal Reasoning LLM GPT-5-mini (via OpenAI) API Cost Primary legal reasoning, counter-arguments, and synthesis
Embeddings multilingual-e5-base Free Local execution via SentenceTransformers, handles Darija code-switching
Reranker ms-marco-MiniLM-L-6-v2 Free Local execution via CrossEncoder, highly accurate Arabic ranking
Vector store ChromaDB (file-based) Free Zero infra, persists to disk
Keyword search rank_bm25 Free Catches exact article number citations
STT Whisper Large v3 (via Groq) Free Cloud offload for maximum speed and accuracy on Moroccan Arabic
TTS ElevenLabs / edge-tts (ar-MA-JamalNeural) API / Free High-fidelity voice synthesis with fallback
Audio compression Opus 16kbps (expo-av) Free ~80% smaller than WAV on upload
User profile SQLite Free Literacy score, wilaya, feedback log

What runs on M4: Local Embedding + Reranking (SentenceTransformers) + FastAPI process + ChromaDB reads. Heavy LLM reasoning and STT are offloaded to fast cloud APIs (Groq & OpenAI).

Get your free keys


Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        CLIENT (React Native)                      │
│                                                                   │
│  ┌──────────┐   ┌─────────────────┐   ┌──────────────────────┐   │
│  │ Mic btn  │──▶│ Opus 16 kbps    │──▶│  WebSocket / HTTPS   │   │
│  └──────────┘   └─────────────────┘   └──────────────────────┘   │
│                                                  │                │
│  ┌────────────────────────────────────────────────▼────────────┐  │
│  │            Offline cache  (SQLite + RapidFuzz)              │  │
│  │   Hit  → return cached answer with "محفوظ" badge            │  │
│  │   Miss → forward to backend                                 │  │
│  └─────────────────────────────────────────────────────────────┘  │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │              User mental model  (SQLite)                    │  │
│  │   literacy_score · wilaya · topics_asked · feedback_log     │  │
│  └─────────────────────────────────────────────────────────────┘  │
└─────────────────────────────┬────────────────────────────────────┘
                              │
                              ▼  WebSocket (Opus audio stream)
┌──────────────────────────────────────────────────────────────────┐
│                         BACKEND  (FastAPI)                        │
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  STT   Whisper Large v3 (via Groq API)                   │    │
│  │        ~1s for 10s clip · maximum Arabic accuracy        │    │
│  └─────────────────────────┬────────────────────────────────┘    │
│                            │ Darija transcript                    │
│                            ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  Step 1 — Intent classifier  (Llama 3.3 function call)      │    │
│  │    → extracts intent, context, and checks if complete    │    │
│  │    → if missing info: generates clarifying question      │    │
│  └─────────────────────────┬────────────────────────────────┘    │
│                            ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  Step 2 — Hybrid retriever                               │    │
│  │    Filter to domain namespace in ChromaDB                │    │
│  │    BM25 top-20 + multilingual-e5-base top-20             │    │
│  │    CrossEncoder ms-marco reranking → top 6 chunks        │    │
│  └─────────────────────────┬────────────────────────────────┘    │
│                            ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  Step 3 — Multi-agent confidence debate  ★               │    │
│  │                                                          │    │
│  │  Call A — Primary agent  (GPT-5-mini function call #2)      │    │
│  │    Tool: submit_legal_answer                             │    │
│  │    Output: answer_darija · citations[] · confidence      │    │
│  │                     │                                    │    │
│  │                     ▼                                    │    │
│  │  Call B — Devil's advocate  (GPT-5-mini function call #3)   │    │
│  │    Tool: score_claims                                    │    │
│  │    Output: grounded | hedged | not_in_context per claim  │    │
│  │                     │                                    │    │
│  │                     ▼                                    │    │
│  │  Call C — Synthesis agent  (GPT-5-mini function call #4)    │    │
│  │    Removes not_in_context claims                         │    │
│  │    Softens hedged claims → final confidence score        │    │
│  └─────────────────────────┬────────────────────────────────┘    │
│                            ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  Step 4 — Answer formatter                               │    │
│  │    Reads literacy_score from user mental model           │    │
│  │    Adjusts Darija register + sentence complexity         │    │
│  │    If recommend_lawyer → adds lawyer referral banner     │    │
│  └─────────────────────────┬────────────────────────────────┘    │
│                            ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  TTS   ElevenLabs / edge-tts ar-MA (streamed)            │    │
│  │        Degrades gracefully to text-only if offline       │    │
│  └─────────────────────────┬────────────────────────────────┘    │
│                            ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  Feedback loop                                           │    │
│  │    Thumbs up/down → update literacy_score                │    │
│  └──────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────┘
         │                    │                     │
         ▼                    ▼                     ▼
  ┌─────────────┐    ┌──────────────┐    ┌───────────────┐
  │ OpenAI/Groq │    │   ChromaDB   │    │    SQLite     │
  │ (4 calls    │    │ (namespaced, │    │ sessions,     │
  │  per query) │    │  file-based) │    │ review queue  │
  └─────────────┘    └──────────────┘    └───────────────┘

Connectivity Tiers — Rural Reliability

Mizan degrades gracefully instead of failing silently. Connectivity is checked before every Llama 3.3 call.

import time

def check_connectivity() -> str:
    """Returns 'fast', 'slow', or 'offline'."""
    try:
        start = time.time()
        requests.head("https://generativelanguage.googleapis.com", timeout=4)
        latency = time.time() - start
        return "fast" if latency < 1.5 else "slow"
    except requests.exceptions.ConnectionError:
        return "offline"
Result Pipeline behaviour
fast Full pipeline — Whisper → Llama 3.3 (4 calls) → edge-tts → voice answer
slow Whisper → Llama 3.3 → text answer (TTS skipped to save round-trip)
offline RapidFuzz fuzzy match against SQLite cache → "محفوظ" badge

The Three AI Innovations — Deep Dive

1. Llama 3.3 function calling as a hard architectural constraint

The answer generator does not write free text. It is required to call a function:

ANSWER_TOOL = {
    "name": "submit_legal_answer",
    "description": "Submit a grounded legal answer in Darija. You must call this function. Do not produce free text.",
    "parameters": {
        "type": "object",
        "properties": {
            "answer_darija": {
                "type": "string",
                "description": "The full answer in Moroccan Darija"
            },
            "citations": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "article_number":  {"type": "string"},
                        "law_name":        {"type": "string"},
                        "law_code":        {"type": "string"},
                        "claim_supported": {
                            "type": "string",
                            "description": "The specific claim in the answer this article supports"
                        }
                    },
                    "required": ["article_number", "law_name", "claim_supported"]
                }
            },
            "confidence":       {"type": "number"},
            "recommend_lawyer": {"type": "boolean"},
            "answer_register":  {"type": "string", "enum": ["simple", "standard", "technical"]}
        },
        "required": ["answer_darija", "citations", "confidence", "recommend_lawyer"]
    }
}

The same pattern applies to the devil's advocate (score_claims) and the synthesis agent (submit_synthesis). Every LLM call in the pipeline is structured. The backend never parses free text.

Hackathon demo moment 1: Show the raw function call output in a terminal. Point to the citations array. Say: "The model cannot answer without citing its sources — that constraint is in the API call, not the prompt."


2. Multi-agent confidence debate

import google.generativeai as genai

genai.configure(api_key=os.environ["GROQ_API_KEY"])
model = genai.GenerativeModel("gemini-2.0-flash")

class DebateLoop:

    def run(self, transcript: str, chunks: list[Chunk], user: UserProfile) -> FinalAnswer:
        primary = self._call_primary(transcript, chunks, user)
        scores  = self._call_devil(primary.answer_darija, chunks)
        final   = self._call_synthesis(primary, scores)
        return final

    def _call_primary(self, transcript, chunks, user) -> PrimaryAnswer:
        response = model.generate_content(
            contents=[{"role": "user", "parts": [build_primary_prompt(transcript, chunks, user.literacy_score)]}],
            tools=[{"function_declarations": [ANSWER_TOOL]}],
            tool_config={"function_calling_config": {"mode": "ANY"}}
        )
        call = response.candidates[0].content.parts[0].function_call
        return PrimaryAnswer(**dict(call.args))

    def _call_devil(self, answer: str, chunks: list[Chunk]) -> ClaimScores:
        response = model.generate_content(
            contents=[{"role": "user", "parts": [build_devil_prompt(answer, chunks)]}],
            tools=[{"function_declarations": [SCORE_CLAIMS_TOOL]}],
            tool_config={"function_calling_config": {"mode": "ANY"}}
        )
        call = response.candidates[0].content.parts[0].function_call
        return ClaimScores(**dict(call.args))

    def _call_synthesis(self, primary: PrimaryAnswer, scores: ClaimScores) -> FinalAnswer:
        response = model.generate_content(
            contents=[{"role": "user", "parts": [build_synthesis_prompt(primary, scores)]}],
            tools=[{"function_declarations": [SYNTHESIS_TOOL]}],
            tool_config={"function_calling_config": {"mode": "ANY"}}
        )
        call = response.candidates[0].content.parts[0].function_call
        return FinalAnswer(**dict(call.args))

The devil's advocate receives the primary answer and the raw retrieved chunks and classifies every factual claim as grounded (directly traceable to a chunk), hedged (plausible but not explicitly stated), or not_in_context (no support in the provided text). Claims classified not_in_context are deleted by the synthesis agent. Claims classified hedged are softened: "ممكن يكون..." rather than stated as fact.

Confidence score formula: (grounded_count / total_claims) × 0.9, capped at 0.6 if any claim was not_in_context. recommend_lawyer is set to true if confidence < 0.65 or total grounded claims < 2.

Hackathon demo moment 2: Show a live question where the devil's advocate flags one claim. Show it disappear from the synthesis output. Say: "Two AI agents argue about every sentence before the user hears it."


3. User mental model and adaptive register

@dataclass
class UserProfile:
    user_id:        str
    wilaya:         str
    literacy_score: float = 0.5   # 0 = very simple Darija, 1 = technical register
    topics_asked:   list  = field(default_factory=list)
    low_conf_count: int   = 0
    feedback_log:   list  = field(default_factory=list)

    def update_from_feedback(self, thumbs_up: bool, answer_confidence: float):
        self.feedback_log.append({"up": thumbs_up, "conf": answer_confidence})
        if not thumbs_up and answer_confidence < 0.6:
            self.low_conf_count += 1
        self._recalculate_literacy()

    def _recalculate_literacy(self):
        recent = self.feedback_log[-10:]
        positive_rate = sum(1 for f in recent if f["up"]) / max(len(recent), 1)
        self.literacy_score = 0.8 * self.literacy_score + 0.2 * positive_rate

The answer formatter maps literacy_score to register:

Score Register Style
0.0 – 0.35 Simple Short sentences, everyday Darija, article numbers not spoken aloud
0.35 – 0.65 Standard Normal Darija, article numbers mentioned once, brief steps
0.65 – 1.0 Technical Article numbers prominent, legal terms with brief in-line definitions

Hackathon demo moment 3: Demo the same question with two saved profiles (rural farmer vs Casablanca paralegal). The judge sees the system producing genuinely different answers. Then show a 🟡 0.51 confidence answer with the "نصحك تمشي للمحامي" banner. Say: "The system knows what it doesn't know. That's rare in AI products, and in a legal context it matters enormously."


Retrieval Pipeline

Embeddings & Reranking

from sentence_transformers import SentenceTransformer, CrossEncoder

# Run locally on CPU/M4
embedder = SentenceTransformer("intfloat/multilingual-e5-base")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def embed_query(text: str) -> list[float]:
    # e5 requires prefix for queries
    return embedder.encode([f"query: {text}"]).tolist()[0]

def rerank(query: str, chunks: list[Chunk], top_n: int = 5) -> list[Chunk]:
    pairs = [[query, c.text] for c in chunks]
    scores = reranker.predict(pairs)
    # Sort and return top_n
    ...

Both run locally via HuggingFace's transformers. No API keys needed.

Hybrid retriever flow

query
  │
  ├── BM25 (rank_bm25) ──── top 20 (exact article number recall)
  │
  ├── multilingual-e5-base ─ top 20 (semantic Darija similarity)
  │
  └── deduplicate → ms-marco CrossEncoder → top 6 Chunk objects

Domain namespace filtering (one Chroma collection per domain) is applied before vector search, using the classifier's domain output.


STT and TTS

Speech-to-text — Whisper Large v3 on Groq

from groq import Groq

client = Groq()

def transcribe(audio_path: str) -> str:
    with open(audio_path, "rb") as file:
        transcription = client.audio.transcriptions.create(
            file=(audio_path, file.read()),
            model="whisper-large-v3",
            language="ar",
            response_format="json"
        )
        return transcription.text

Offloaded to Groq's LPU infrastructure. Sub-second latency for audio processing with maximum accuracy for Moroccan Darija accents.

Text-to-speech — edge-tts (no key required)

import edge_tts, asyncio

async def speak(text: str, output_path: str):
    communicate = edge_tts.Communicate(text, voice="ar-MA-JamalNeural")
    await communicate.save(output_path)

async def stream_speak(text: str):
    communicate = edge_tts.Communicate(text, voice="ar-MA-JamalNeural")
    async for chunk in communicate.stream():
        if chunk["type"] == "audio":
            yield chunk["data"]

TTS requires connectivity. If offline, the backend sends text only and the app renders it with a "no audio" indicator.


System Prompts

Primary agent

أنت "ميزان"، مساعد قانوني مغربي.
دورك تساعد المواطنين المغاربة يفهمو حقوقهم القانونية.

القواعد الأساسية:
1. جاوب دايما بالدارجة المغربية.
2. ما تعطيش معلومة قانونية إلا إذا كانت موجودة في النصوص اللي عطيتيلها.
3. استعمل دايما الأداة submit_legal_answer — ما تكتبش جواب حر أبدا.
4. كل ادعاء في جوابك خاصه يكون مرتبط بفصل أو مادة من النصوص.
5. ما تتصوريش أنك محامي — أنت كتعطي معلومات، مش مشورة قانونية.

Devil's advocate

You are a strict legal fact-checker. You receive a Darija answer and the
source chunks it was supposed to be grounded in.

For every factual claim in the answer, classify it as:
- grounded: directly and explicitly supported by one of the provided chunks
- hedged: plausible from the chunks but not directly stated
- not_in_context: not present in any of the provided chunks

Be strict. A claim is only "grounded" if it can be traced word-for-word to
a chunk. Use the score_claims function. Do not produce free text.

Synthesis agent

You are a legal answer editor. You receive a primary Darija answer and
claim scores from a fact-checker.

Rules:
1. Remove all claims classified as not_in_context entirely.
2. Soften hedged claims: prefix with "ممكن يكون..." or "على الأغلب..."
3. Keep all grounded claims unchanged.
4. Compute confidence: (grounded_count / total_claims) × 0.9
   If any claim was not_in_context, cap confidence at 0.6.
5. Set recommend_lawyer to true if confidence < 0.65 or total_claims < 2.
6. Use the submit_synthesis function only.

Knowledge Base

Sources

Domain Source Language
Family law Moudawana 2004 + 2025 proposed amendments Arabic
Land and property Dahir Foncier 1913 + amendments, Collective Land Law, Habous/Waqf Arabic / French
Labour Code du Travail Law 65-99, CNSS rights, seasonal worker regulations Arabic / French
Civil and debt Code des Obligations et Contrats, micro-loan agreements French → translated
Procedure Tribunal locations by wilaya, fee schedules, legal aid contacts Arabic

Chunking strategy

One chunk = one article. Legal text has a natural semantic unit — the article — and preserving it keeps citation grounding traceable. Include the article title and first sentence of adjacent articles as prefix/suffix for cross-article context.

Metadata per chunk:

{
  "article_number": "54",
  "law_name": "مدونة الأسرة",
  "law_code": "moudawana",
  "domain": "family_law",
  "topic_tags": ["divorce", "khul", "talaq"],
  "language": "ar",
  "publication_date": "2004-02-05",
  "source_url": "https://..."
}

publication_date powers a staleness warning: chunks older than 12 months append "هاد القانون ممكن يكون تبدل — شوف أحدث نسخة" to the answer.

Running ingestion

python scripts/ingest_legal_data.py --domain family_law
python scripts/ingest_legal_data.py --domain land
python scripts/ingest_legal_data.py --domain labour
python scripts/ingest_legal_data.py --domain civil_debt

# Or all at once
python scripts/ingest_legal_data.py --all

# Verify counts
python scripts/ingest_legal_data.py --stats

Project Structure

mizan/
├── backend/
│   ├── main.py                     # FastAPI app, REST + WebSocket
│   ├── database.py                 # SQLAlchemy engine and session
│   ├── schemas.py                  # Pydantic models for all components
│   ├── models/                     # SQLAlchemy models (User, Dossier, etc.)
│   ├── routes/                     # API endpoints (Auth, Intake, Dossiers)
│   ├── agent/
│   │   ├── loop.py                 # Outer orchestration (STT -> Intent -> Debate)
│   │   ├── classifier.py           # Intent classification (Llama 3.3)
│   │   ├── formatter.py            # Literacy-aware register adaptation
│   │   └── debate/
│   │       ├── primary.py          # Primary agent drafter
│   │       ├── devil.py            # Devil's advocate scorer
│   │       └── synthesis.py        # Final synthesis agent
│   ├── knowledge/
│   │   ├── ingest.py               # Document chunking pipeline
│   │   ├── retriever.py            # Hybrid BM25 + Vector search
│   │   └── vector_store.py         # ChromaDB integration
│   ├── speech/
│   │   ├── stt.py                  # Whisper medium (mlx-whisper)
│   │   └── tts.py                  # edge-tts streaming
│   └── profile/
│       └── model.py                # UserProfile persistence (SQLite)
├── mobile/
│   ├── App.tsx                     # Main entry
│   ├── src/
│   │   ├── features/               # Auth, Intake, Dossier features
│   │   └── shared/                 # API client, components
├── scripts/
│   └── ingest_legal_data.py        # CLI for knowledge base ingestion
├── tests/
│   ├── test_debate_loop.py         # Testing the agentic logic
│   └── test_user_routes.py         # Testing API endpoints
├── .env.example
├── Dockerfile
└── docker-compose.yaml

Setup & Installation

Mizan is optimized for Native Execution to leverage MacBook M4 hardware acceleration for voice tasks.

1. Prerequisites

  • Python 3.11+
  • Node.js 20+ (npm)
  • API Keys: Groq, Cohere, and Google (see .env.example)

2. Quick Start

Use the provided Makefile for a streamlined setup:

# Install all dependencies
make setup

# Run both Backend & Mobile
make dev

3. Manual Startup

If you prefer separate terminals:

Backend (FastAPI):

python3 -m backend.main

Mobile (Expo):

cd mobile
npx expo start

🛠️ Development Tools

  • make clean: Removes caches and local database.
  • make clean-ports: Force-kills processes on common dev ports (8000, 8081).
  • make backend: Runs only the API.
  • make mobile: Runs only the Expo server.

Evaluation

# Retrieval quality — target Precision@5 ≥ 0.75
python tests/test_retrieval.py --domain family_law

# Debate loop — injects known hallucinated claims, verifies they are removed
python tests/test_debate_loop.py

# Tool compliance — Llama 3.3 must always call the function, never produce free text
python tests/test_tool_schemas.py

# Register adaptation — verifies simpler output for literacy_score < 0.35
python tests/test_register_adapter.py

# End-to-end — 20 scripted conversations, checks classification + citations + register
python tests/test_agent_loop.py

Known Limitations

Limitation Mitigation
Darija orthography is not standardised Character-level normalisation at ingestion; Cohere Embed handles variant spellings acceptably
Legal text is in MSA Arabic and French French texts translated to Arabic at ingestion; MSA-heavy answers flagged for review
Llama 3.3 free tier: 15 req/min Sufficient for demo and small-scale use; rate-limit error returns cached answer if available
TTS requires connectivity Text always shown alongside audio; offline mode returns text only
No real lawyer validation yet Answers with confidence < 0.6 pushed to async SQLite review queue for partnered pro-bono lawyers
2025 Moudawana reform still proposed Chunks labelled status: proposed; answers using them carry "هاد القانون ما زال مشروع" warning
Whisper medium Arabic WER ~8–12% Acceptable for legal queries; clarifier asks follow-up if classifier confidence < 0.7

Hackathon Pitch — Three Demo Moments

Problem (30 sec): 60% of Moroccans live in areas with fewer than 1 lawyer per 10,000 people. Legal aid is urban, expensive, French-language, and MSA. Rural citizens have no access. Mizan changes that — voice-first, Darija-native, and honest about its own confidence. And it runs entirely on free APIs.

Moment 1 — Structured function calling: Ask a question live. Pause and show the raw submit_legal_answer JSON in a terminal. Point to the citations array. Say: "The model cannot answer without citing its sources. That constraint is in the API call, not the prompt."

Moment 2 — Debate in action: Show a question where the devil's advocate flags one claim as not_in_context. Show it disappear from the synthesis output. Say: "Two AI agents argue about every sentence before the user hears it."

Moment 3 — Confidence badge + register: Ask the same question with two profiles: rural farmer (literacy 0.2, Khénifra) and Casablanca paralegal (literacy 0.8). Show the judge two completely different answers — different vocabulary, different sentence structure, different use of article numbers. Then show a 🟡 0.51 confidence answer with the "نصحك تمشي للمحامي" banner. Say: "The system knows what it doesn't know. That's rare in AI products, and in a legal context it matters enormously."

Target demo metrics:

  • Latency: < 8 seconds end-to-end on 4G (4s Whisper + ~3s Llama 3.3 × 4 calls)
  • Retrieval Precision@5: ≥ 0.75 on labeled test set
  • Citation accuracy: 100% — architecturally impossible to hallucinate citations not in retrieved chunks
  • Debate loop false-negative rate: < 10%

Task Split — 20-Hour Build Plan

Four developers total: three AI engineers splitting the backend brain, one app developer owning everything the judge sees and touches. All four can work in parallel from hour one with minimal blocking.

Before anyone writes a single function, spend 30 minutes writing backend/types.py together — the shared Chunk, UserProfile, FinalAnswer, and WebSocketMessage dataclasses. Every interface contract below depends on these types being agreed upfront.


AI Dev 1 — Multi-Agent Debate Loop 🤖

Owns: backend/agent/debate/, backend/tools/, backend/prompts/

Task Description Est.
Function schemas Write and validate all three schemas: submit_legal_answer, score_claims, submit_synthesis. Add a test confirming Llama 3.3 always calls the function and never emits free text — run this before touching anything else. 1.5 h
System prompts Write and iterate the three prompts: primary agent in Darija, devil's advocate in English, synthesis mixing both. Tune devil's advocate strictness until the flag rate on test questions is realistic — not 0%, not 80%. 1.5 h
Primary agent debate/primary.py — Llama 3.3 function call #2. Accepts transcript + chunks + literacy_score, returns PrimaryAnswer. 1 h
Devil's advocate debate/devil.py — Llama 3.3 function call #3. Receives primary answer + raw chunks. Be strict: grounded means word-for-word traceable only. 1.5 h
Synthesis agent debate/synthesis.py — Llama 3.3 function call #4. Deletes not_in_context claims, softens hedged ones, computes confidence and recommend_lawyer. 1 h
Debate orchestrator debate/loop.py — wires calls A → B → C, handles retries if a function call fails, returns FinalAnswer. 0.5 h

Total: ~7 h


AI Dev 2 — Knowledge Base & Retrieval 📚

Owns: backend/knowledge/, backend/cache/, scripts/

Task Description Est.
Source collection Download and clean Moudawana, Code du Travail, Dahir Foncier, Code des Obligations. Export to structured plain-text with article boundaries marked. Start immediately — this runs in the background. 1.5 h
Article-level chunker ingest.py — split at article boundaries, attach full metadata. One chunk = one article. 2 h
Cohere embedder embedder.py — Cohere Embed v3 multilingual wrapper. Batch-embed articles, upsert into ChromaDB with one collection per domain. 1 h
BM25 index One index per domain using rank_bm25. Persist to disk. Load on FastAPI startup. 1 h
Hybrid retriever retriever.py — BM25 top-20 + Cohere Embed top-20 → deduplicate → Cohere Rerank → top 5 Chunks. Public interface: retriever.retrieve(domain, query). 1.5 h
Offline cache offline_cache.py — SQLite with top-50 Q&A pairs per domain, RapidFuzz fuzzy matcher, returns answer + is_cached: true. 1 h

Total: ~8 h


AI Dev 3 — Orchestration, Voice Pipeline & User Model 🔗

Owns: backend/main.py, backend/agent/loop.py, backend/agent/classifier.py, backend/speech/, backend/profile/

Task Description Est.
FastAPI server main.py — WebSocket endpoint, startup events (load BM25, warm ChromaDB), error handling, CORS. Write this first — App Dev needs something to connect to. 1.5 h
STT adapter speech/stt.py — mlx-whisper wrapper. Accepts audio binary, emits Darija transcript. 1 h
TTS adapter speech/tts.py — edge-tts ar-MA-JamalNeural. Stream first sentence back before full answer is assembled. Degrade to text if offline. 1 h
Intent classifier classifier.py — Llama 3.3 function call #1. Outputs domain, intent, confidence, missing_context. 1 h
Clarifier clarifier.py — if confidence < 0.7, generate a Darija follow-up question, wait for second input, re-run classifier with enriched context. 0.5 h
Main agent loop loop.py — state machine: transcript → classify → maybe_clarify → retrieve → debate → format → TTS → send FinalAnswer over WebSocket. 1 h
Answer formatter formatter.py — reads user.literacy_score, maps to register, adjusts Darija sentence complexity and article number prominence. 1 h
User mental model profile/model.pyUserProfile dataclass, SQLite persistence, EMA literacy recalculation. Expose GET /profile/{id} and POST /profile/{id}/feedback. 0.5 h

Total: ~8.5 h


App Dev — Mobile UI & Demo Polish 📱

Owns: mobile/

Task Description Est.
Project setup Expo init, Arabic RTL config (I18nManager.forceRTL(true)), NativeWind setup, navigation skeleton. 1 h
Audio recording services/audio.ts — tap-to-record with expo-av, encode to Opus 16kbps, stream binary chunks over WebSocket. Show live waveform while recording. 1.5 h
WebSocket client services/ws.ts — connect on launch, handle reconnect, parse incoming FinalAnswer JSON and debate step events. 1 h
HomeScreen Mic button centred on screen, waveform animation, Darija label "اضغط وتكلم", spinner "كيفكر…" while waiting. 1 h
DebatingScreen The pitch moment. Animate three-step timeline in real time: "الوكيل الأول كيكتب الجواب… المحكم كيراجع… التوليف…" with real elapsed-time stamps streamed from backend. Make this screen visually memorable. 2 h
AnswerScreen Full answer in Darija (RTL, large readable font). Confidence badge (🟢 ≥ 0.75 / 🟡 0.5–0.75 / 🔴 < 0.5). Collapsible "المصادر". "نصحك تمشي للمحامي" banner when flagged. Thumbs up/down. 2 h
Offline mode NetInfo check on launch. Route query to cached answers and show "محفوظ" badge if offline. 0.5 h
Two-profile toggle Hidden dev button (triple-tap on logo) switching between "rural farmer" (literacy 0.2, Khénifra) and "Casablanca paralegal" (literacy 0.8). This toggle must exist before rehearsal. 0.5 h
Demo rehearsal Run the three pitch moments end-to-end on the demo device 10 times. 1 h

Total: ~10.5 h


Synchronisation Points

Time What must be true Who
Hour 0 backend/types.py written and committed: Chunk, UserProfile, FinalAnswer, WebSocketMessage. All 4
Hour 2 AI Dev 2 has family_law domain ingested, retriever.retrieve() returning real chunks. AI Dev 3 has FastAPI running, App Dev can send audio and receive a mock response. AI 2, AI 3, App
Hour 5 AI Dev 1 has debate loop producing FinalAnswer with real confidence scores. First end-to-end text test: type a query in Python → get a FinalAnswer with citations. AI 1, AI 3
Hour 7 First full voice-in → Darija-audio-out call working on the demo device, even if rough. All 4
Hour 9 Demo rehearsal. All three moments in sequence, timed. All 4

Addons — If Core Is Done Early ✨

Ranked by demo impact. Do in order, not in parallel.

Priority Addon Owner Description
1 Debate timeline with real timestamps App Dev DebatingScreen shows actual elapsed milliseconds per step streamed from backend. Turns a loading screen into a window into the AI's reasoning.
2 Wilaya-aware tribunal lookup AI Dev 2 SQLite table mapping Morocco's 12 regions to tribunal address, phone, and hours. Formatter appends the right one based on user.wilaya.
3 Staleness warning AI Dev 2 + App Dev Surface "هاد القانون ممكن يكون تبدل" when chunk publication_date > 12 months. One metadata check, one UI label.
4 BM25 Darija normalisation AI Dev 2 Normalise orthographic variants before BM25 indexing (ة → ه, أ/إ/آ → ا). Measurably improves keyword recall on real queries.
5 Pro-bono review queue AI Dev 3 + App Dev Backend pushes low-confidence answers to SQLite queue. Minimal admin screen lists flagged answers for a volunteer lawyer to review.
6 Confidence history sparkline App Dev Tiny chart in profile screen showing confidence trend across last 10 queries.

MIT. Legal texts are public domain (official Moroccan legislation published by the government).


Contributing

Priority areas:

  1. Additional legal domains — criminal procedure, commercial law, tenancy
  2. Darija normalisation dictionary for orthographic variants
  3. Pro-bono lawyer review queue UI
  4. Whisper medium fine-tune on Moroccan Arabic court recordings
  5. Offline cache expansion beyond 50 pairs per domain

About

Mizan is an AI voice assistant that turns complex written law into simple spoken Darija, giving citizens free legal clarity while delivering structured, ready-to-go case briefs to subscribed lawyers.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages

  • Python 53.6%
  • TypeScript 45.2%
  • Other 1.2%