Skip to content

razancodes/Meta-Pytorch-Hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

142 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

title Memex AML Investigation Environment
emoji 🧠
colorFrom gray
colorTo indigo
sdk docker
app_port 7860

🧠 Memex: World Feedback for Financial Crime Detection

Outcome-Grounded Reward Signals for Long-Horizon AML Investigation Agent

A POMDP environment where language models manage Virtual Memory, handle Interrupts, and self-update their Kernel — trained via outcome-grounded world feedback from a procedurally generated OS-agent benchmark.

OpenEnv HF Space Smoke Tests Python 3.10+ License: MIT

Built for the Meta / Hugging Face OpenEnv Hackathon

Live Environment · Blog Post · Colab Notebook · Training Guide


Quick Start

# 1. Verify the environment is live
curl https://muaztpm-aml-investigation-env.hf.space/health
# → {"status": "healthy"}

# 2. Run a test episode
curl -X POST https://muaztpm-aml-investigation-env.hf.space/reset \
  -H "Content-Type: application/json" -d '{"task_id": "easy"}'

# 3. View the Glass Box Visualizer
# Visit: https://muaztpm-aml-investigation-env.hf.space/

# 4. Trained model checkpoint
# https://huggingface.co/MuazTPM/defender-model

What Is Memex?

Memex is an OpenEnv-compatible RL environment that tests whether an LLM can operate — not just answer questions. It layers three OS subsystems on top of an AML (Anti-Money Laundering) investigation task:

OS Concept What the Agent Must Do Penalty for Failure
Virtual Memory Save critical evidence to disk before it's evicted from the 2-slot context window Page Fault (−0.05)
Interrupts Launch async wire traces, continue working, collect results after ETA Async Timeout (−0.10)
Kernel Updates Search the compliance manual and inject relevant rules into its own system prompt Missed compliance rules → wrong verdicts

The agent has 18 tools across three categories, 3 AML typologies (structuring, layering, trade-based ML), 3 difficulty levels, and a procedural generator that creates unique scenarios on every reset() — making memorization impossible.


Training

We train a Qwen2.5-7B-Instruct (4-bit via Unsloth) using TRL's GRPOTrainer with 4 decomposed reward functions:

Reward What It Scores
R1 Format Compliance Is the output valid JSON with a known tool name?
R2 Investigation Quality Does the agent use diverse tools across categories?
R3 Environment Execution Multi-step env.step() against a deterministically-seeded scenario
R4 OS Mechanics Does the agent use disk writes, async traces, and kernel updates?
# Dry-run (4 prompts, 1 epoch)
python train_grpo.py --dry-run

# Full training (v2 hyperparameters)
python train_grpo.py --num-prompts 250 --epochs 2 --lr 5e-6 --beta 0.04 \
    --output-dir checkpoints/defender-grpo-v2

See TRAINING.md for copy-paste Colab cells, full CLI reference, and hyperparameter details.


Results

150 steps on an A100, 3h 44m. The agent went from producing random single-tool outputs to running full multi-step investigations with all three OS mechanics.

Training Curves

Reward and environment execution curves showing improvement over 150 steps Total reward trending from ~0 → ~4.5. R3 (environment execution) shows the strongest learning signal.

Learning rate decay, KL divergence, and gradient norm Healthy cosine LR decay. KL divergence stays bounded. frac_reward_zero_std drops to 0 — every GRPO group has reward variance.

Completion length growth from ~200 to ~800 tokens Completion lengths grow from ~200 → ~800 tokens as the agent learns longer investigation chains.

Quantitative Improvement

Metric Step 0 Step 150
Total reward ~0 ~4.5
R1 (format) Mixed 1.00
R2 (investigation) ~0.2 0.60
R3 (env execution) ~0 1.79
R4 (OS mechanics) 0.0 1.10
Completion length ~200 tok ~800 tok

Behavioral Change

Behavior Before Training After Training
Memory management References evicted data → page faults Writes evidence to disk before eviction
Async handling Retrieves prematurely → timeouts Interleaves work while waiting
Kernel updates Ignores compliance rules Searches manual, injects relevant mode
Investigation depth 1-2 tool calls 7-12 step investigation chains
Terminal decision Always files SAR (lazy) Correctly distinguishes TP vs TN

Architecture

Memex System Architecture


Tool Roster (18 Tools)

Domain Investigation (11) OS Mechanic (5) Terminal (2)
review_alert write_to_case_file — Page to disk file_sar
get_customer_profile request_wire_trace — Async job close_alert
query_transactions retrieve_async_result — Fetch result
check_watchlist search_compliance_manual — Find rules
trace_network update_system_prompt — Kernel inject
check_source_of_funds
check_market_price
assess_risk
check_device_overlap
verify_customs_invoice
query_beneficial_ownership

Reward Design

Per-step (dense signal):

Event Reward
Action cost −0.02
Redundant call −0.03
Page fault −0.05
Async timeout −0.10
Disk write +0.10 (cap 3/ep)
Kernel injection +0.15 (cap 2/ep)

Terminal (composite):

Component Weight
Detection (TP/TN/FP/FN) 1.0
Entity F1 + Findings 0.5
Typology accuracy 0.3
Efficiency 0.2
OS mechanics 0.2

Anti-gaming: 6 measures including hard caps, closed kernel modes, redundancy penalties, action costs, unique procedural IDs, and a formally proven "always SAR" trap (E[R_always_SAR] = 0.475 < E[R_reasonable] ≈ 0.68).


Local Development

git clone https://github.com/razancodes/Meta-Pytorch-Hackathon.git
cd Meta-Pytorch-Hackathon
pip install -r requirements.txt

# Start server
uvicorn openenv_server:app --host 0.0.0.0 --port 8000

# Smoke tests (8/8)
python tests/test_smoke.py

# 1MDB demo
python demo_eval.py --dry-run

# Inference (any OpenAI-compatible LLM)
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
python inference.py

Deployment

# Docker
docker build -t memex . && docker run -p 7860:7860 memex

# HF Spaces
openenv push --ignore-file .hfignore
# → https://huggingface.co/spaces/MuazTPM/aml_investigation_env

Project Structure

.
├── openenv_server.py            # ★ OpenEnv FastAPI entrypoint
├── openenv.yaml                 # OpenEnv contract
├── models.py                    # Pydantic types (Action, Observation, State)
├── state_manager.py             # OS mechanics engine (RAM, Disk, Async, Kernel)
├── client.py                    # HTTP client (18 tool wrappers)
├── inference.py                 # ReAct inference agent
│
├── train_grpo.py                # ★ GRPO training (TRL + Unsloth)
├── self_play.py                 # Two-agent PPO self-play orchestrator
├── eval_harness.py              # Multi-typology evaluation harness
├── demo_eval.py                 # 1MDB demo + AGUI replay
│
├── server/
│   ├── aml_environment.py       # Core env (18 tools + OS mechanics)
│   ├── launderer_env.py         # Launderer single-step MDP
│   └── app.py                   # Standalone FastAPI server
├── scenarios/
│   ├── procedural_generator.py  # POMDP scenario builder
│   ├── adversary_agent.py       # Evasive scenario generator
│   ├── compliance_manual.py     # Searchable AML rule corpus
│   └── base.py                  # Scenario ABC
├── graders/
│   └── grader.py                # Dense reward engine
├── curriculum/
│   ├── plr_engine.py            # Prioritized Level Replay
│   └── oracle.py                # Proxy regret oracle
│
├── agent_os_core/               # ★ AgentOS-Kernel (production inference)
│   ├── src/lib.rs               #   Rust Tokio runtime (PyO3 bindings)
│   ├── Cargo.toml               #   Rust deps (pyo3, tokio, reqwest)
│   ├── agent_os.py              #   Orchestrator (vLLM + Qwen2.5-72B-AWQ)
│   ├── memory_manager.py        #   L1/L2 cognitive cache (Qwen2.5-1.5B)
│   ├── l3_index.py              #   L3 LanceDB index (BGE embeddings)
│   ├── test_integration.py      #   6-test end-to-end suite
│   ├── test_memory.py           #   5-test L1/L2 suite
│   ├── test_l3.py               #   5-test L3 suite
│   └── test_runtime.py          #   Rust runtime unit tests
│
├── frontend/                    # Next.js Glass Box Visualizer
├── assets/                      # WandB training curve screenshots
├── archive/                     # Legacy scripts (PPO, DPO, hotswap, validators)
├── tests/
│   ├── test_smoke.py            # 8 end-to-end tests
│   └── test_plr.py              # PLR engine unit tests
├── Dockerfile                   # HF Spaces deployment
├── requirements.txt             # Runtime dependencies
└── .hfignore                    # HF push exclusions

AgentOS-Kernel (agent_os_core/)

Production inference middleware for long-context agentic reasoning on bare-metal GPU (A100 80GB). Solves context starvation — the Lost in the Middle problem where evidence gets buried in the attention dead zone.

Architecture

Component Model VRAM Purpose
Reasoning Engine Qwen2.5-72B-Instruct-AWQ (vLLM) ~38 GB JSON-constrained tool call generation
Compaction Engine Qwen2.5-1.5B-Instruct ~3 GB L1→L2 structured fact extraction
Embedder BAAI/bge-base-en-v1.5 ~0.4 GB L3 LanceDB vector indexing
Reranker BAAI/bge-reranker-v2-m3 ~1.1 GB Cross-encoder relevance gating
Tool Runtime Rust/Tokio via PyO3 0 GIL-bypass concurrent tool execution

3-Tier Cognitive Cache

  • L1 (6K tokens) — Sliding window of raw conversation turns
  • L2 (2K tokens) — Structured scratchpad, compacted by LLM — injected at prompt start (high attention)
  • L3 (unbounded) — LanceDB vector archive — gated retrieval injected at prompt end (high attention)

Quick Start

cd agent_os_core
python -m venv .venv && source .venv/bin/activate
pip install tiktoken lancedb numpy pyarrow maturin
maturin develop --release          # Build Rust runtime
python test_integration.py         # 16/16 tests pass (mock mode, no GPU)

Further Reading

  • BLOG.md — Deep-dive: how we built the OS-agent concept, debugging zero-gradient GRPO, anti-gaming reward design, and the 1MDB demo walkthrough
  • TRAINING.md — Copy-paste Colab cells, full CLI reference, hyperparameter tables, WandB monitoring guide

License

MIT

About

OpenEnv RL Environment to Combat Money Laundering with a special technique to handle long contexts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors