F-Zero RL

Train an RL agent to race a "unsolvable" SNES game — within 6% of the human world record.

Trained RL agent racing F-Zero Mute City I

Trained PPO agent racing Mute City I — Fire Stingray, 5 laps

Results • Key Insight • Architecture • Quick Start • Training Details

The Problem

F-Zero (SNES, 1991) was labeled an "unsolved challenge" for reinforcement learning in a 2016 academic paper. The reason: the game provides zero intermediate score during racing. The only feedback is a final time after a 5-lap race — roughly 6,000 decision steps of pure exploration before any reward signal. Standard DQN and A3C agents could not even finish a single race.

This project solves F-Zero by reading 58 RAM addresses directly from the SNES emulator to construct dense, per-step rewards from track progress. Combined with a dual-input neural architecture adapted from Linesight (the Trackmania AI that beat 10/12 world records), the agent learns to race competitively in ~10 hours of training.

Results

Metric	This Project	Human World Record	2016 Paper
Race Time (5 laps)	125.22s	117.96s	Did not finish
Gap to WR	~6%	—	—
Laps Completed	5/5	5/5	0–2/5
Training Time	~10 hours	Years of practice	—

Bhatt et al., "Playing SNES in the Retro Learning Environment" (2016) reported F-Zero as unsolved — standard DQN/A3C could not complete races due to the extreme sparse reward problem.

Agent behavior at convergence:

Consistently completes 5-lap races on Mute City I
Uses boost (Super Jet) on 43.6% of steps
Employs blast turning (rapid accel toggle) on 3% of steps
Applies shoulder lean on 20% of corners

Key Insight: Reward Shaping

F-Zero gives the agent nothing to learn from — no score, no checkpoints, no progress bar. Just 5 laps of silence, then a final time. At 20 decisions/second, the agent must explore ~6,000 steps blindly before receiving a single reward. No RL algorithm can solve this naively.

Our solution: Read 58 RAM addresses from the SNES emulator to extract the car's real-time position, then project it onto a spline-interpolated track centerline to compute dense, per-step progress rewards.

Mute City I track centerline from 58 RAM checkpoints

58 checkpoint coordinates from RAM (blue dots) spline-interpolated into a smooth 660-point centerline. Reward = progress along this curve each step.

Reward per step:
r = (delta / ref) + 0.5 * (delta / ref)^2
where delta = track progress per step, ref = average delta at 135s race pace

The linear term provides learning gradient even at low speeds (critical for cold start). The quadratic term creates accelerating gradient at higher speeds, breaking the plateau where simpler rewards stall.

Linear+quadratic shaping provides increasing gradient at higher speeds, breaking the speed plateau that pure linear rewards create.

Architecture

The policy network uses a dual-input architecture adapted from Linesight: a CNN processes game frames while an MLP processes RAM-extracted features, fused into a shared representation for PPO's policy and value heads.

graph LR
    subgraph Observation
        A["Game Screen<br/>4 × 84 × 96 grayscale"]
        B["RAM Features<br/>speed, energy,<br/>track preview (59-dim)"]
    end

    subgraph "Dual-Input Feature Extractor (~2-3M params)"
        A --> CNN["CNN<br/>4→16→32→64→32<br/>LeakyReLU"]
        B --> MLP["MLP<br/>59→128→128<br/>LeakyReLU"]
        CNN -->|"1792"| Concat["Concat"]
        MLP -->|"128"| Concat
        Concat --> Fusion["Linear 1920→1024<br/>+ LeakyReLU"]
    end

    subgraph "PPO Output Heads"
        Fusion --> Policy["Policy Head<br/>MultiDiscrete(3,3,2,2,2)"]
        Fusion --> Value["Value Head<br/>scalar V(s)"]
    end

Action space — MultiDiscrete([3, 3, 2, 2, 2]), 5 independent dimensions learned simultaneously:

Dimension	Values	SNES Button
Steer	straight / left / right	D-pad
Shoulder	none / L-lean / R-lean	L / R
Accelerate	hold / release	B (enables blast turning)
Brake	no / yes	Y
Boost	no / yes	A (Super Jet)

Training Pipeline

graph TB
    subgraph "Data Collection"
        E["80× SNES Emulators<br/>(SubprocVecEnv)"] -->|"~900 fps"| R["Rollout Buffer<br/>512 steps × 80 envs"]
    end

    subgraph "Policy Update"
        R --> PPO["PPO<br/>4 epochs, batch 2048<br/>clip=0.2, γ=0.99"]
        PPO -->|"Updated weights"| E
    end

    subgraph "Per-Step Reward Shaping"
        RAM["58 RAM Addresses"] --> Proj["Project onto<br/>660-point spline<br/>centerline"]
        Proj --> Delta["Δ = track progress"]
        Delta --> Reward["r = Δ/ref + 0.5·(Δ/ref)²"]
    end

    E -.->|"reads RAM each step"| RAM

Quick Start

git clone https://github.com/<your-username>/RL-Gaming.git && cd RL-Gaming
pip install -r requirements.txt

# Place your F-Zero ROM
mkdir -p roms && cp /path/to/F-Zero\ \(USA\).sfc roms/

# Train with PPO (80 parallel SNES emulators, ~900 fps)
python -m training.train --algo ppo --timesteps 50000000

# Evaluate a trained model
python -m evaluation.evaluate --model models/fzero_ppo_final.zip --episodes 10

# Other algorithms
python -m training.train --algo dqn --n-envs 4
python -m training.train --algo iqn --n-envs 80

Requirements: Python 3.10–3.12, F-Zero (USA) ROM (.sfc), GPU optional (~200 MB VRAM)

Training Details

Parameter	Value	Parameter	Value
Learning rate	3e-4 (adaptive)	Parallel envs	80
Batch size	2048	Frameskip	3 (20 Hz)
PPO epochs	4	Frame stack	4
γ (gamma)	0.99	Entropy coeff	0.01
GAE λ	0.95	Total timesteps	50M (~10h)

γ=0.99 provides ~5s planning horizon — enough for corner anticipation while keeping value function scale manageable.

Balancing cold-start gradient (linear), high-speed acceleration (quadratic), and training stability.

Comparison to Related Work

Aspect	Linesight (Trackmania)	This Project (F-Zero SNES)
Algorithm	IQN (off-policy)	PPO (on-policy)
Action space	Continuous (analog)	MultiDiscrete (digital SNES)
Reward	progress − fixed penalty	progress (linear + quadratic)
Throughput	~20 fps (3D game)	~900 fps (SNES emulation)
Result	Beat 10/12 world records	Within 6% of WR (vs. "unsolvable")

Project Structure

RL-Gaming/
├── env/                          # Gymnasium environment (Stable-Retro wrapper)
│   ├── fzero_env.py             # Main wrapper: dual obs, shaped reward, termination
│   ├── rewards.py               # Dense reward via spline projection (660-pt centerline)
│   ├── observations.py          # CNN frame processing + float feature builder
│   ├── actions.py               # MultiDiscrete ↔ SNES button mapping
│   └── FZero-Snes/              # Stable-Retro integration (58 RAM addresses)
├── network/
│   ├── dual_input.py            # CNN+MLP feature extractor (SB3 compatible)
│   └── iqn.py                   # IQN with branching dueling heads
├── training/
│   ├── train.py                 # CLI entrypoint (PPO / DQN / QR-DQN / IQN)
│   ├── config.py                # All hyperparameters — single source of truth
│   ├── iqn_trainer.py           # Custom IQN training loop
│   └── callbacks.py             # Adaptive LR, best-model saving, W&B logging
├── evaluation/
│   ├── evaluate.py              # Run model, collect race times & metrics
│   └── overlay.py               # Real-time debug overlay on game frames
├── tests/                       # Unit tests (rewards, actions, network, observations)
├── docs/                        # Design docs, experiment log, analysis plots
├── scripts/                     # Utilities (centerline gen, reward analysis, GIF recording)
└── models/                      # Trained checkpoints

Acknowledgments

Linesight — Architecture inspiration: dual-input CNN+MLP, IQN with branching dueling heads
Stable-Retro — SNES emulation with Gymnasium API
Stable-Baselines3 — PPO, DQN, and vectorized environment infrastructure
SnesLab F-Zero RAM Map — RAM address documentation

References

Bhatt et al., "Playing SNES in the Retro Learning Environment" (2016) — labeled F-Zero unsolved for RL
Schulman et al., "Proximal Policy Optimization Algorithms" (2017)
Dabney et al., "Implicit Quantile Networks for Distributional RL" (2018)

@misc{fzero-rl-2026,
  author = {Zhou, Yincheng},
  title  = {F-Zero RL: Solving an "Unsolved" Racing Challenge with Reward Shaping},
  year   = {2026},
  url    = {https://github.com/<your-username>/RL-Gaming}
}

Built with reward engineering, 80 parallel SNES emulators, and a lot of wall collisions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

F-Zero RL

The Problem

Results

Key Insight: Reward Shaping

Architecture

Training Pipeline

Quick Start

Training Details

Comparison to Related Work

Project Structure

Acknowledgments

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
env		env
evaluation		evaluation
network		network
scripts		scripts
tests		tests
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

F-Zero RL

The Problem

Results

Key Insight: Reward Shaping

Architecture

Training Pipeline

Quick Start

Training Details

Comparison to Related Work

Project Structure

Acknowledgments

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages