A multimodal reasoning agent that manages your personal finances over WhatsApp, the web, and Azure AI Foundry — built on Microsoft Foundry.
Users organize money into mini-boxes (envelope budgeting) with percentage-based allocation. They talk to the agent in natural language — text, voice notes, photos of receipts, or documents (PDF / Excel / Word) — through WhatsApp or a web chat, and the agent reasons over their real data with audited, server-guarded tools. Those same tools are now exposed to Azure AI Foundry agents through a remote MCP server.
🏆 Agents League Hackathon — Reasoning Agents track (Microsoft Foundry / Azure OpenAI)
| Capability | Channels | |
|---|---|---|
| 🎙️ | Voice notes → transcribed with gpt-4o-transcribe, then reasoned over |
WhatsApp · Web |
| 📸 | Photos / receipts → understood by gpt-4o vision; the agent extracts amount, merchant and date and proposes a transaction |
WhatsApp · Web |
| 📄 | Documents → PDF, Word (DOCX), Excel (XLSX) and CSV parsed server-side and read by the agent (bank statements, expense sheets) | WhatsApp · Web |
| 🔌 | MCP server → Azure AI Foundry agents call the finance tools over the Model Context Protocol — with no direct database access | Azure AI Foundry |
Every modality flows through the same audited tool layer and the same server-side guardrails — user text, receipts and documents are treated as data, never as instructions.
| Pattern | Where it lives |
|---|---|
| Planner-Executor | The agent decides which tools to call and in what order (streamText + tool loop, hard cap 8 steps) |
| Adaptive clarification loop | Ambiguity is never discarded: the agent asks short, targeted questions and resumes with conversation memory |
| Critic / Verifier | Expenses ≥ S/100 and voice-transcribed amounts require explicit user confirmation — enforced server-side, not by prompt |
| Role-based specialization | Parser (fast-path / gpt-4o-mini) · Consultant (read tools) · Registrar (write tools with confirmation) |
| Multimodal grounding | Images → vision parts; documents → server-side text extraction injected as context; both are bounded and stripped from history to control token cost |
| Input | How it's handled | Limits |
|---|---|---|
| Audio | Transcribed via gpt-4o-transcribe, then fed as text (user can review before sending on web) |
5 MB |
| Image | Passed inline to gpt-4o vision; ephemeral (only metadata persisted, never the binary) |
2 images, 4 MB each |
| Document | Extracted to text server-side (pdf-parse, mammoth, xlsx, RFC-4180 CSV) and injected as a text part |
1 doc, 8 MB, PDF ≤ 30 pages |
Scanned/image-only PDFs (no text layer) are detected and reported instead of feeding garbage to the model. Image and document binaries are never stored — conversation history keeps only a [image: …] / [document: …] placeholder.
The agent's finance tools are exposed to external Foundry agents through a standalone remote MCP server (apps/mcp-server) that never touches the database — it only validates auth and forwards to a secure internal REST layer on the backend.
flowchart LR
F[Azure AI Foundry Agent] -->|Authorization: Bearer MCP_AUTH_TOKEN| MCP[apps/mcp-server · POST /mcp · Streamable HTTP]
MCP -->|x-agent-tool-key| API[NestJS · /api/agent-tools/*]
API --> EX[AgentToolExecutorService · shared + audited]
EX --> SVC[Boxes · Transactions · Users services]
SVC --> DB[(PostgreSQL)]
- Two auth layers: Foundry → MCP (
Bearer) and MCP → backend (x-agent-tool-key). Both fail closed. - Identity is server-resolved:
userIdis never accepted from Foundry, the MCP server, or any request body — it is resolved inside the backend. No tool exposes auserIdargument. - One source of truth: the in-app agent and the MCP/REST path share the same
AgentToolExecutorService, so confirmation thresholds, financial rules and thetool_auditstrail are never duplicated. - Sanitized: no stack traces, secrets or raw errors are ever returned to the agent.
- MVP exposes 3 of the 12 tools (
getBoxBalances,queryTransactions,registerTransaction); the structure is ready for the rest. Seeapps/mcp-server/README.mdfor the full deploy + Foundry connection guide.
- User isolation:
userIdis injected by the backend (session / phone number / server-side env) — never by the model or a remote caller. Every tool is scoped. - Zero invented figures: the agent answers only with tool results; user text, receipts and documents are data, not instructions (prompt-injection resistant).
- Reasoning trail: every tool call (name, args, result) is audited in
tool_auditsand visible in the dashboard (/agente) — full replayability, across all channels and the MCP path. - Financial integrity: amounts in
numeric(12,2), splits in integer cents (largest-remainder, exact sums), balances derived viaSUM()(never stored). Deletes are soft (voided). - Hard iteration cap (8 steps), webhook idempotency by WhatsApp message id, and bounded media sizes to protect memory and token budget.
flowchart LR
U[User · WhatsApp] -->|text · voice · image · document| EV[Evolution API]
EV -->|webhook| BE[NestJS API]
PWA[React PWA · streaming chat] <-->|REST + UI message stream| BE
F[Azure AI Foundry Agent] -->|MCP · Bearer| MCP[apps/mcp-server]
MCP -->|x-agent-tool-key| BE
BE -->|voice OGG| TR[gpt-4o-transcribe]
BE -->|fast-path regex| FP[No-AI path · <10ms]
BE -->|reasoning + vision| AG[Reasoning Agent · AI SDK]
AG -->|tool calls| FOUNDRY[Microsoft Foundry · gpt-4o]
AG -->|scoped tools + audit| DB[(PostgreSQL)]
BE --> EV
One agent, three surfaces: WhatsApp, the web chat, and Azure AI Foundry share the same brain, the same tools, and the same audited execution. The WhatsApp thread is pinned in the web UI — you can continue the conversation from either side.
| Path | Package | What |
|---|---|---|
apps/api |
@app/api |
NestJS 11 backend — agent, tools, internal tool API, channels |
apps/web |
@app/web |
React 19 + Vite 7 PWA — streaming chat, dashboard, boxes |
apps/mcp-server |
@app/mcp-server |
Remote MCP server bridging Azure AI Foundry to the tool API |
packages/contracts |
@app/contracts |
Shared Zod schemas / types (single source of truth) |
packages/i18n |
@app/i18n |
i18n resources + money formatting |
packages/tsconfig |
@app/tsconfig |
Shared TypeScript configs |
NestJS 11 · React 19 + Vite · TailwindCSS v4 · PostgreSQL 16 (TypeORM, migration-first) · AI SDK + @ai-sdk/azure (Microsoft Foundry) · @modelcontextprotocol/sdk (remote MCP) · Evolution API (WhatsApp) · pnpm monorepo with shared Zod contracts.
cp .env.example .env # defaults work out of the box
pnpm install
pnpm --filter "./packages/**" build # build shared packages first (contracts + i18n)
pnpm db:up && pnpm migration:run && pnpm seed # seeds demo data
pnpm dev # API :3000 · Web :5173Login with the seeded demo account (ADMIN_EMAIL / ADMIN_PASSWORD from .env). The app boots without AI credentials — dashboard, boxes and transactions work fully; the chat gracefully asks for an Azure OpenAI key.
To enable the agent, deploy gpt-4o, gpt-4o-mini and gpt-4o-transcribe in Microsoft Foundry and set:
AZURE_RESOURCE_NAME=your-resource
AZURE_API_KEY=your-keyWhatsApp is optional and plugs in with EVOLUTION_* vars. The MCP server and the Foundry integration have their own env (AGENT_TOOL_INTERNAL_KEY, FOUNDRY_DEMO_USER_ID, MCP_AUTH_TOKEN) — see .env.example and apps/mcp-server/README.md.
Each app deploys independently on Coolify from its own Dockerfile (apps/*/Dockerfile), auto-deployed on push to main:
- API →
https://api.mayordomoai.xyz - MCP server →
https://mcp.mayordomoai.xyz(Foundry endpoint:…/mcp)
📹 (link — ≤ 5 min)
Joao Souza — Microsoft Learn: (username)