| title | Memex AML Investigation Environment |
|---|---|
| emoji | 🧠 |
| colorFrom | gray |
| colorTo | indigo |
| sdk | docker |
| app_port | 7860 |
A POMDP environment where language models manage Virtual Memory, handle Interrupts, and self-update their Kernel — trained via outcome-grounded world feedback from a procedurally generated OS-agent benchmark.
Built for the Meta / Hugging Face OpenEnv Hackathon
Live Environment · Blog Post · Colab Notebook · Training Guide
# 1. Verify the environment is live
curl https://muaztpm-aml-investigation-env.hf.space/health
# → {"status": "healthy"}
# 2. Run a test episode
curl -X POST https://muaztpm-aml-investigation-env.hf.space/reset \
-H "Content-Type: application/json" -d '{"task_id": "easy"}'
# 3. View the Glass Box Visualizer
# Visit: https://muaztpm-aml-investigation-env.hf.space/
# 4. Trained model checkpoint
# https://huggingface.co/MuazTPM/defender-modelMemex is an OpenEnv-compatible RL environment that tests whether an LLM can operate — not just answer questions. It layers three OS subsystems on top of an AML (Anti-Money Laundering) investigation task:
| OS Concept | What the Agent Must Do | Penalty for Failure |
|---|---|---|
| Virtual Memory | Save critical evidence to disk before it's evicted from the 2-slot context window | Page Fault (−0.05) |
| Interrupts | Launch async wire traces, continue working, collect results after ETA | Async Timeout (−0.10) |
| Kernel Updates | Search the compliance manual and inject relevant rules into its own system prompt | Missed compliance rules → wrong verdicts |
The agent has 18 tools across three categories, 3 AML typologies (structuring, layering, trade-based ML), 3 difficulty levels, and a procedural generator that creates unique scenarios on every reset() — making memorization impossible.
We train a Qwen2.5-7B-Instruct (4-bit via Unsloth) using TRL's GRPOTrainer with 4 decomposed reward functions:
| Reward | What It Scores |
|---|---|
| R1 Format Compliance | Is the output valid JSON with a known tool name? |
| R2 Investigation Quality | Does the agent use diverse tools across categories? |
| R3 Environment Execution | Multi-step env.step() against a deterministically-seeded scenario |
| R4 OS Mechanics | Does the agent use disk writes, async traces, and kernel updates? |
# Dry-run (4 prompts, 1 epoch)
python train_grpo.py --dry-run
# Full training (v2 hyperparameters)
python train_grpo.py --num-prompts 250 --epochs 2 --lr 5e-6 --beta 0.04 \
--output-dir checkpoints/defender-grpo-v2See TRAINING.md for copy-paste Colab cells, full CLI reference, and hyperparameter details.
150 steps on an A100, 3h 44m. The agent went from producing random single-tool outputs to running full multi-step investigations with all three OS mechanics.
Total reward trending from ~0 → ~4.5. R3 (environment execution) shows the strongest learning signal.
Healthy cosine LR decay. KL divergence stays bounded. frac_reward_zero_std drops to 0 — every GRPO group has reward variance.
Completion lengths grow from ~200 → ~800 tokens as the agent learns longer investigation chains.
| Metric | Step 0 | Step 150 |
|---|---|---|
| Total reward | ~0 | ~4.5 |
| R1 (format) | Mixed | 1.00 |
| R2 (investigation) | ~0.2 | 0.60 |
| R3 (env execution) | ~0 | 1.79 |
| R4 (OS mechanics) | 0.0 | 1.10 |
| Completion length | ~200 tok | ~800 tok |
| Behavior | Before Training | After Training |
|---|---|---|
| Memory management | References evicted data → page faults | Writes evidence to disk before eviction |
| Async handling | Retrieves prematurely → timeouts | Interleaves work while waiting |
| Kernel updates | Ignores compliance rules | Searches manual, injects relevant mode |
| Investigation depth | 1-2 tool calls | 7-12 step investigation chains |
| Terminal decision | Always files SAR (lazy) | Correctly distinguishes TP vs TN |
| Domain Investigation (11) | OS Mechanic (5) | Terminal (2) |
|---|---|---|
review_alert |
write_to_case_file — Page to disk |
file_sar |
get_customer_profile |
request_wire_trace — Async job |
close_alert |
query_transactions |
retrieve_async_result — Fetch result |
|
check_watchlist |
search_compliance_manual — Find rules |
|
trace_network |
update_system_prompt — Kernel inject |
|
check_source_of_funds |
||
check_market_price |
||
assess_risk |
||
check_device_overlap |
||
verify_customs_invoice |
||
query_beneficial_ownership |
Per-step (dense signal):
| Event | Reward |
|---|---|
| Action cost | −0.02 |
| Redundant call | −0.03 |
| Page fault | −0.05 |
| Async timeout | −0.10 |
| Disk write | +0.10 (cap 3/ep) |
| Kernel injection | +0.15 (cap 2/ep) |
Terminal (composite):
| Component | Weight |
|---|---|
| Detection (TP/TN/FP/FN) | 1.0 |
| Entity F1 + Findings | 0.5 |
| Typology accuracy | 0.3 |
| Efficiency | 0.2 |
| OS mechanics | 0.2 |
Anti-gaming: 6 measures including hard caps, closed kernel modes, redundancy penalties, action costs, unique procedural IDs, and a formally proven "always SAR" trap (E[R_always_SAR] = 0.475 < E[R_reasonable] ≈ 0.68).
git clone https://github.com/razancodes/Meta-Pytorch-Hackathon.git
cd Meta-Pytorch-Hackathon
pip install -r requirements.txt
# Start server
uvicorn openenv_server:app --host 0.0.0.0 --port 8000
# Smoke tests (8/8)
python tests/test_smoke.py
# 1MDB demo
python demo_eval.py --dry-run
# Inference (any OpenAI-compatible LLM)
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
python inference.py# Docker
docker build -t memex . && docker run -p 7860:7860 memex
# HF Spaces
openenv push --ignore-file .hfignore
# → https://huggingface.co/spaces/MuazTPM/aml_investigation_env.
├── openenv_server.py # ★ OpenEnv FastAPI entrypoint
├── openenv.yaml # OpenEnv contract
├── models.py # Pydantic types (Action, Observation, State)
├── state_manager.py # OS mechanics engine (RAM, Disk, Async, Kernel)
├── client.py # HTTP client (18 tool wrappers)
├── inference.py # ReAct inference agent
│
├── train_grpo.py # ★ GRPO training (TRL + Unsloth)
├── self_play.py # Two-agent PPO self-play orchestrator
├── eval_harness.py # Multi-typology evaluation harness
├── demo_eval.py # 1MDB demo + AGUI replay
│
├── server/
│ ├── aml_environment.py # Core env (18 tools + OS mechanics)
│ ├── launderer_env.py # Launderer single-step MDP
│ └── app.py # Standalone FastAPI server
├── scenarios/
│ ├── procedural_generator.py # POMDP scenario builder
│ ├── adversary_agent.py # Evasive scenario generator
│ ├── compliance_manual.py # Searchable AML rule corpus
│ └── base.py # Scenario ABC
├── graders/
│ └── grader.py # Dense reward engine
├── curriculum/
│ ├── plr_engine.py # Prioritized Level Replay
│ └── oracle.py # Proxy regret oracle
│
├── agent_os_core/ # ★ AgentOS-Kernel (production inference)
│ ├── src/lib.rs # Rust Tokio runtime (PyO3 bindings)
│ ├── Cargo.toml # Rust deps (pyo3, tokio, reqwest)
│ ├── agent_os.py # Orchestrator (vLLM + Qwen2.5-72B-AWQ)
│ ├── memory_manager.py # L1/L2 cognitive cache (Qwen2.5-1.5B)
│ ├── l3_index.py # L3 LanceDB index (BGE embeddings)
│ ├── test_integration.py # 6-test end-to-end suite
│ ├── test_memory.py # 5-test L1/L2 suite
│ ├── test_l3.py # 5-test L3 suite
│ └── test_runtime.py # Rust runtime unit tests
│
├── frontend/ # Next.js Glass Box Visualizer
├── assets/ # WandB training curve screenshots
├── archive/ # Legacy scripts (PPO, DPO, hotswap, validators)
├── tests/
│ ├── test_smoke.py # 8 end-to-end tests
│ └── test_plr.py # PLR engine unit tests
├── Dockerfile # HF Spaces deployment
├── requirements.txt # Runtime dependencies
└── .hfignore # HF push exclusions
Production inference middleware for long-context agentic reasoning on bare-metal GPU (A100 80GB). Solves context starvation — the Lost in the Middle problem where evidence gets buried in the attention dead zone.
| Component | Model | VRAM | Purpose |
|---|---|---|---|
| Reasoning Engine | Qwen2.5-72B-Instruct-AWQ (vLLM) | ~38 GB | JSON-constrained tool call generation |
| Compaction Engine | Qwen2.5-1.5B-Instruct | ~3 GB | L1→L2 structured fact extraction |
| Embedder | BAAI/bge-base-en-v1.5 | ~0.4 GB | L3 LanceDB vector indexing |
| Reranker | BAAI/bge-reranker-v2-m3 | ~1.1 GB | Cross-encoder relevance gating |
| Tool Runtime | Rust/Tokio via PyO3 | 0 | GIL-bypass concurrent tool execution |
- L1 (6K tokens) — Sliding window of raw conversation turns
- L2 (2K tokens) — Structured scratchpad, compacted by LLM — injected at prompt start (high attention)
- L3 (unbounded) — LanceDB vector archive — gated retrieval injected at prompt end (high attention)
cd agent_os_core
python -m venv .venv && source .venv/bin/activate
pip install tiktoken lancedb numpy pyarrow maturin
maturin develop --release # Build Rust runtime
python test_integration.py # 16/16 tests pass (mock mode, no GPU)- BLOG.md — Deep-dive: how we built the OS-agent concept, debugging zero-gradient GRPO, anti-gaming reward design, and the 1MDB demo walkthrough
- TRAINING.md — Copy-paste Colab cells, full CLI reference, hyperparameter tables, WandB monitoring guide
MIT
