🧠 Memex: World Feedback for Financial Crime Detection

title	Memex AML Investigation Environment
emoji	🧠
colorFrom	gray
colorTo	indigo
sdk	docker
app_port	7860

🧠 Memex: World Feedback for Financial Crime Detection

Outcome-Grounded Reward Signals for Long-Horizon AML Investigation Agent

A POMDP environment where language models manage Virtual Memory, handle Interrupts, and self-update their Kernel — trained via outcome-grounded world feedback from a procedurally generated OS-agent benchmark.

Built for the Meta / Hugging Face OpenEnv Hackathon

Live Environment · Blog Post · Colab Notebook · Training Guide

Quick Start

# 1. Verify the environment is live
curl https://muaztpm-aml-investigation-env.hf.space/health
# → {"status": "healthy"}

# 2. Run a test episode
curl -X POST https://muaztpm-aml-investigation-env.hf.space/reset \
  -H "Content-Type: application/json" -d '{"task_id": "easy"}'

# 3. View the Glass Box Visualizer
# Visit: https://muaztpm-aml-investigation-env.hf.space/

# 4. Trained model checkpoint
# https://huggingface.co/MuazTPM/defender-model

What Is Memex?

Memex is an OpenEnv-compatible RL environment that tests whether an LLM can operate — not just answer questions. It layers three OS subsystems on top of an AML (Anti-Money Laundering) investigation task:

OS Concept	What the Agent Must Do	Penalty for Failure
Virtual Memory	Save critical evidence to disk before it's evicted from the 2-slot context window	Page Fault (−0.05)
Interrupts	Launch async wire traces, continue working, collect results after ETA	Async Timeout (−0.10)
Kernel Updates	Search the compliance manual and inject relevant rules into its own system prompt	Missed compliance rules → wrong verdicts

The agent has 18 tools across three categories, 3 AML typologies (structuring, layering, trade-based ML), 3 difficulty levels, and a procedural generator that creates unique scenarios on every reset() — making memorization impossible.

Training

We train a Qwen2.5-7B-Instruct (4-bit via Unsloth) using TRL's GRPOTrainer with 4 decomposed reward functions:

Reward	What It Scores
R1 Format Compliance	Is the output valid JSON with a known tool name?
R2 Investigation Quality	Does the agent use diverse tools across categories?
R3 Environment Execution	Multi-step `env.step()` against a deterministically-seeded scenario
R4 OS Mechanics	Does the agent use disk writes, async traces, and kernel updates?

# Dry-run (4 prompts, 1 epoch)
python train_grpo.py --dry-run

# Full training (v2 hyperparameters)
python train_grpo.py --num-prompts 250 --epochs 2 --lr 5e-6 --beta 0.04 \
    --output-dir checkpoints/defender-grpo-v2

See TRAINING.md for copy-paste Colab cells, full CLI reference, and hyperparameter details.

Results

150 steps on an A100, 3h 44m. The agent went from producing random single-tool outputs to running full multi-step investigations with all three OS mechanics.

Training Curves

Total reward trending from ~0 → ~4.5. R3 (environment execution) shows the strongest learning signal.

Healthy cosine LR decay. KL divergence stays bounded. frac_reward_zero_std drops to 0 — every GRPO group has reward variance.

Completion lengths grow from ~200 → ~800 tokens as the agent learns longer investigation chains.

Quantitative Improvement

Metric	Step 0	Step 150
Total reward	~0	~4.5
R1 (format)	Mixed	1.00
R2 (investigation)	~0.2	0.60
R3 (env execution)	~0	1.79
R4 (OS mechanics)	0.0	1.10
Completion length	~200 tok	~800 tok

Behavioral Change

Behavior	Before Training	After Training
Memory management	References evicted data → page faults	Writes evidence to disk before eviction
Async handling	Retrieves prematurely → timeouts	Interleaves work while waiting
Kernel updates	Ignores compliance rules	Searches manual, injects relevant mode
Investigation depth	1-2 tool calls	7-12 step investigation chains
Terminal decision	Always files SAR (lazy)	Correctly distinguishes TP vs TN

Architecture

Tool Roster (18 Tools)

Domain Investigation (11)	OS Mechanic (5)	Terminal (2)
`review_alert`	`write_to_case_file` — Page to disk	`file_sar`
`get_customer_profile`	`request_wire_trace` — Async job	`close_alert`
`query_transactions`	`retrieve_async_result` — Fetch result
`check_watchlist`	`search_compliance_manual` — Find rules
`trace_network`	`update_system_prompt` — Kernel inject
`check_source_of_funds`
`check_market_price`
`assess_risk`
`check_device_overlap`
`verify_customs_invoice`
`query_beneficial_ownership`

Reward Design

Per-step (dense signal):

Event	Reward
Action cost	−0.02
Redundant call	−0.03
Page fault	−0.05
Async timeout	−0.10
Disk write	+0.10 (cap 3/ep)
Kernel injection	+0.15 (cap 2/ep)

Terminal (composite):

Component	Weight
Detection (TP/TN/FP/FN)	1.0
Entity F1 + Findings	0.5
Typology accuracy	0.3
Efficiency	0.2
OS mechanics	0.2

Anti-gaming: 6 measures including hard caps, closed kernel modes, redundancy penalties, action costs, unique procedural IDs, and a formally proven "always SAR" trap (E[R_always_SAR] = 0.475 < E[R_reasonable] ≈ 0.68).

Local Development

git clone https://github.com/razancodes/Meta-Pytorch-Hackathon.git
cd Meta-Pytorch-Hackathon
pip install -r requirements.txt

# Start server
uvicorn openenv_server:app --host 0.0.0.0 --port 8000

# Smoke tests (8/8)
python tests/test_smoke.py

# 1MDB demo
python demo_eval.py --dry-run

# Inference (any OpenAI-compatible LLM)
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
python inference.py

Deployment

# Docker
docker build -t memex . && docker run -p 7860:7860 memex

# HF Spaces
openenv push --ignore-file .hfignore
# → https://huggingface.co/spaces/MuazTPM/aml_investigation_env

Project Structure

.
├── openenv_server.py            # ★ OpenEnv FastAPI entrypoint
├── openenv.yaml                 # OpenEnv contract
├── models.py                    # Pydantic types (Action, Observation, State)
├── state_manager.py             # OS mechanics engine (RAM, Disk, Async, Kernel)
├── client.py                    # HTTP client (18 tool wrappers)
├── inference.py                 # ReAct inference agent
│
├── train_grpo.py                # ★ GRPO training (TRL + Unsloth)
├── self_play.py                 # Two-agent PPO self-play orchestrator
├── eval_harness.py              # Multi-typology evaluation harness
├── demo_eval.py                 # 1MDB demo + AGUI replay
│
├── server/
│   ├── aml_environment.py       # Core env (18 tools + OS mechanics)
│   ├── launderer_env.py         # Launderer single-step MDP
│   └── app.py                   # Standalone FastAPI server
├── scenarios/
│   ├── procedural_generator.py  # POMDP scenario builder
│   ├── adversary_agent.py       # Evasive scenario generator
│   ├── compliance_manual.py     # Searchable AML rule corpus
│   └── base.py                  # Scenario ABC
├── graders/
│   └── grader.py                # Dense reward engine
├── curriculum/
│   ├── plr_engine.py            # Prioritized Level Replay
│   └── oracle.py                # Proxy regret oracle
│
├── agent_os_core/               # ★ AgentOS-Kernel (production inference)
│   ├── src/lib.rs               #   Rust Tokio runtime (PyO3 bindings)
│   ├── Cargo.toml               #   Rust deps (pyo3, tokio, reqwest)
│   ├── agent_os.py              #   Orchestrator (vLLM + Qwen2.5-72B-AWQ)
│   ├── memory_manager.py        #   L1/L2 cognitive cache (Qwen2.5-1.5B)
│   ├── l3_index.py              #   L3 LanceDB index (BGE embeddings)
│   ├── test_integration.py      #   6-test end-to-end suite
│   ├── test_memory.py           #   5-test L1/L2 suite
│   ├── test_l3.py               #   5-test L3 suite
│   └── test_runtime.py          #   Rust runtime unit tests
│
├── frontend/                    # Next.js Glass Box Visualizer
├── assets/                      # WandB training curve screenshots
├── archive/                     # Legacy scripts (PPO, DPO, hotswap, validators)
├── tests/
│   ├── test_smoke.py            # 8 end-to-end tests
│   └── test_plr.py              # PLR engine unit tests
├── Dockerfile                   # HF Spaces deployment
├── requirements.txt             # Runtime dependencies
└── .hfignore                    # HF push exclusions

AgentOS-Kernel (`agent_os_core/`)

Production inference middleware for long-context agentic reasoning on bare-metal GPU (A100 80GB). Solves context starvation — the Lost in the Middle problem where evidence gets buried in the attention dead zone.

Architecture

Component	Model	VRAM	Purpose
Reasoning Engine	Qwen2.5-72B-Instruct-AWQ (vLLM)	~38 GB	JSON-constrained tool call generation
Compaction Engine	Qwen2.5-1.5B-Instruct	~3 GB	L1→L2 structured fact extraction
Embedder	BAAI/bge-base-en-v1.5	~0.4 GB	L3 LanceDB vector indexing
Reranker	BAAI/bge-reranker-v2-m3	~1.1 GB	Cross-encoder relevance gating
Tool Runtime	Rust/Tokio via PyO3	0	GIL-bypass concurrent tool execution

3-Tier Cognitive Cache

L1 (6K tokens) — Sliding window of raw conversation turns
L2 (2K tokens) — Structured scratchpad, compacted by LLM — injected at prompt start (high attention)
L3 (unbounded) — LanceDB vector archive — gated retrieval injected at prompt end (high attention)

Quick Start

cd agent_os_core
python -m venv .venv && source .venv/bin/activate
pip install tiktoken lancedb numpy pyarrow maturin
maturin develop --release          # Build Rust runtime
python test_integration.py         # 16/16 tests pass (mock mode, no GPU)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Memex: World Feedback for Financial Crime Detection

Outcome-Grounded Reward Signals for Long-Horizon AML Investigation Agent

Quick Start

What Is Memex?

Training

Results

Training Curves

Quantitative Improvement

Behavioral Change

Architecture

Tool Roster (18 Tools)

Reward Design

Local Development

Deployment

Project Structure

AgentOS-Kernel (`agent_os_core/`)

Architecture

3-Tier Cognitive Cache

Quick Start

Further Reading

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
agent_os_core		agent_os_core
archive		archive
assets		assets
curriculum		curriculum
data		data
frontend		frontend
graders		graders
scenarios		scenarios
server		server
static_frontend		static_frontend
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.hfignore		.hfignore
BLOG.md		BLOG.md
Dockerfile		Dockerfile
Memex_~_LarpLegends.ipynb		Memex_~_LarpLegends.ipynb
PROJECT_CONTEXT.md		PROJECT_CONTEXT.md
README.md		README.md
TRAINING.md		TRAINING.md
__init__.py		__init__.py
client.py		client.py
demo_eval.py		demo_eval.py
eval_harness.py		eval_harness.py
inference.py		inference.py
models.py		models.py
openenv.yaml		openenv.yaml
openenv_server.py		openenv_server.py
output.txt		output.txt
output2.txt		output2.txt
paper_context.md		paper_context.md
probe_results.json		probe_results.json
requirements.txt		requirements.txt
reward_hacking_probe.py		reward_hacking_probe.py
self_play.py		self_play.py
state_manager.py		state_manager.py
train_grpo.py		train_grpo.py

Folders and files

Latest commit

History

Repository files navigation

🧠 Memex: World Feedback for Financial Crime Detection

Outcome-Grounded Reward Signals for Long-Horizon AML Investigation Agent

Quick Start

What Is Memex?

Training

Results

Training Curves

Quantitative Improvement

Behavioral Change

Architecture

Tool Roster (18 Tools)

Reward Design

Local Development

Deployment

Project Structure

AgentOS-Kernel (agent_os_core/)

Architecture

3-Tier Cognitive Cache

Quick Start

Further Reading

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

AgentOS-Kernel (`agent_os_core/`)

Packages