Train an RL agent to race a "unsolvable" SNES game — within 6% of the human world record.
Trained PPO agent racing Mute City I — Fire Stingray, 5 laps
Results • Key Insight • Architecture • Quick Start • Training Details
F-Zero (SNES, 1991) was labeled an "unsolved challenge" for reinforcement learning in a 2016 academic paper. The reason: the game provides zero intermediate score during racing. The only feedback is a final time after a 5-lap race — roughly 6,000 decision steps of pure exploration before any reward signal. Standard DQN and A3C agents could not even finish a single race.
This project solves F-Zero by reading 58 RAM addresses directly from the SNES emulator to construct dense, per-step rewards from track progress. Combined with a dual-input neural architecture adapted from Linesight (the Trackmania AI that beat 10/12 world records), the agent learns to race competitively in ~10 hours of training.
| Metric | This Project | Human World Record | 2016 Paper |
|---|---|---|---|
| Race Time (5 laps) | 125.22s | 117.96s | Did not finish |
| Gap to WR | ~6% | — | — |
| Laps Completed | 5/5 | 5/5 | 0–2/5 |
| Training Time | ~10 hours | Years of practice | — |
Bhatt et al., "Playing SNES in the Retro Learning Environment" (2016) reported F-Zero as unsolved — standard DQN/A3C could not complete races due to the extreme sparse reward problem.
Agent behavior at convergence:
- Consistently completes 5-lap races on Mute City I
- Uses boost (Super Jet) on 43.6% of steps
- Employs blast turning (rapid accel toggle) on 3% of steps
- Applies shoulder lean on 20% of corners
F-Zero gives the agent nothing to learn from — no score, no checkpoints, no progress bar. Just 5 laps of silence, then a final time. At 20 decisions/second, the agent must explore ~6,000 steps blindly before receiving a single reward. No RL algorithm can solve this naively.
Our solution: Read 58 RAM addresses from the SNES emulator to extract the car's real-time position, then project it onto a spline-interpolated track centerline to compute dense, per-step progress rewards.
58 checkpoint coordinates from RAM (blue dots) spline-interpolated into a smooth 660-point centerline. Reward = progress along this curve each step.
Reward per step:
r = (delta / ref) + 0.5 * (delta / ref)^2where
delta= track progress per step,ref= average delta at 135s race pace
The linear term provides learning gradient even at low speeds (critical for cold start). The quadratic term creates accelerating gradient at higher speeds, breaking the plateau where simpler rewards stall.
Linear+quadratic shaping provides increasing gradient at higher speeds, breaking the speed plateau that pure linear rewards create.
The policy network uses a dual-input architecture adapted from Linesight: a CNN processes game frames while an MLP processes RAM-extracted features, fused into a shared representation for PPO's policy and value heads.
graph LR
subgraph Observation
A["Game Screen<br/>4 × 84 × 96 grayscale"]
B["RAM Features<br/>speed, energy,<br/>track preview (59-dim)"]
end
subgraph "Dual-Input Feature Extractor (~2-3M params)"
A --> CNN["CNN<br/>4→16→32→64→32<br/>LeakyReLU"]
B --> MLP["MLP<br/>59→128→128<br/>LeakyReLU"]
CNN -->|"1792"| Concat["Concat"]
MLP -->|"128"| Concat
Concat --> Fusion["Linear 1920→1024<br/>+ LeakyReLU"]
end
subgraph "PPO Output Heads"
Fusion --> Policy["Policy Head<br/>MultiDiscrete(3,3,2,2,2)"]
Fusion --> Value["Value Head<br/>scalar V(s)"]
end
Action space — MultiDiscrete([3, 3, 2, 2, 2]), 5 independent dimensions learned simultaneously:
| Dimension | Values | SNES Button |
|---|---|---|
| Steer | straight / left / right | D-pad |
| Shoulder | none / L-lean / R-lean | L / R |
| Accelerate | hold / release | B (enables blast turning) |
| Brake | no / yes | Y |
| Boost | no / yes | A (Super Jet) |
graph TB
subgraph "Data Collection"
E["80× SNES Emulators<br/>(SubprocVecEnv)"] -->|"~900 fps"| R["Rollout Buffer<br/>512 steps × 80 envs"]
end
subgraph "Policy Update"
R --> PPO["PPO<br/>4 epochs, batch 2048<br/>clip=0.2, γ=0.99"]
PPO -->|"Updated weights"| E
end
subgraph "Per-Step Reward Shaping"
RAM["58 RAM Addresses"] --> Proj["Project onto<br/>660-point spline<br/>centerline"]
Proj --> Delta["Δ = track progress"]
Delta --> Reward["r = Δ/ref + 0.5·(Δ/ref)²"]
end
E -.->|"reads RAM each step"| RAM
git clone https://github.com/<your-username>/RL-Gaming.git && cd RL-Gaming
pip install -r requirements.txt
# Place your F-Zero ROM
mkdir -p roms && cp /path/to/F-Zero\ \(USA\).sfc roms/
# Train with PPO (80 parallel SNES emulators, ~900 fps)
python -m training.train --algo ppo --timesteps 50000000
# Evaluate a trained model
python -m evaluation.evaluate --model models/fzero_ppo_final.zip --episodes 10
# Other algorithms
python -m training.train --algo dqn --n-envs 4
python -m training.train --algo iqn --n-envs 80Requirements: Python 3.10–3.12, F-Zero (USA) ROM (.sfc), GPU optional (~200 MB VRAM)
| Parameter | Value | Parameter | Value |
|---|---|---|---|
| Learning rate | 3e-4 (adaptive) | Parallel envs | 80 |
| Batch size | 2048 | Frameskip | 3 (20 Hz) |
| PPO epochs | 4 | Frame stack | 4 |
| γ (gamma) | 0.99 | Entropy coeff | 0.01 |
| GAE λ | 0.95 | Total timesteps | 50M (~10h) |
γ=0.99 provides ~5s planning horizon — enough for corner anticipation while keeping value function scale manageable.
Balancing cold-start gradient (linear), high-speed acceleration (quadratic), and training stability.
| Aspect | Linesight (Trackmania) | This Project (F-Zero SNES) |
|---|---|---|
| Algorithm | IQN (off-policy) | PPO (on-policy) |
| Action space | Continuous (analog) | MultiDiscrete (digital SNES) |
| Reward | progress − fixed penalty | progress (linear + quadratic) |
| Throughput | ~20 fps (3D game) | ~900 fps (SNES emulation) |
| Result | Beat 10/12 world records | Within 6% of WR (vs. "unsolvable") |
RL-Gaming/
├── env/ # Gymnasium environment (Stable-Retro wrapper)
│ ├── fzero_env.py # Main wrapper: dual obs, shaped reward, termination
│ ├── rewards.py # Dense reward via spline projection (660-pt centerline)
│ ├── observations.py # CNN frame processing + float feature builder
│ ├── actions.py # MultiDiscrete ↔ SNES button mapping
│ └── FZero-Snes/ # Stable-Retro integration (58 RAM addresses)
├── network/
│ ├── dual_input.py # CNN+MLP feature extractor (SB3 compatible)
│ └── iqn.py # IQN with branching dueling heads
├── training/
│ ├── train.py # CLI entrypoint (PPO / DQN / QR-DQN / IQN)
│ ├── config.py # All hyperparameters — single source of truth
│ ├── iqn_trainer.py # Custom IQN training loop
│ └── callbacks.py # Adaptive LR, best-model saving, W&B logging
├── evaluation/
│ ├── evaluate.py # Run model, collect race times & metrics
│ └── overlay.py # Real-time debug overlay on game frames
├── tests/ # Unit tests (rewards, actions, network, observations)
├── docs/ # Design docs, experiment log, analysis plots
├── scripts/ # Utilities (centerline gen, reward analysis, GIF recording)
└── models/ # Trained checkpoints
- Linesight — Architecture inspiration: dual-input CNN+MLP, IQN with branching dueling heads
- Stable-Retro — SNES emulation with Gymnasium API
- Stable-Baselines3 — PPO, DQN, and vectorized environment infrastructure
- SnesLab F-Zero RAM Map — RAM address documentation
- Bhatt et al., "Playing SNES in the Retro Learning Environment" (2016) — labeled F-Zero unsolved for RL
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017)
- Dabney et al., "Implicit Quantile Networks for Distributional RL" (2018)
@misc{fzero-rl-2026,
author = {Zhou, Yincheng},
title = {F-Zero RL: Solving an "Unsolved" Racing Challenge with Reward Shaping},
year = {2026},
url = {https://github.com/<your-username>/RL-Gaming}
}Apache 2.0 License © 2026 Zhou Yincheng
Built with reward engineering, 80 parallel SNES emulators, and a lot of wall collisions.