Skip to content

ArtysicistZ/FZero

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

F-Zero RL

Train an RL agent to race a "unsolvable" SNES game — within 6% of the human world record.

World Top 300 — F-Zero Mute City I

Python 3.10-3.12 PyTorch 2.1+ Stable-Baselines3 Stable-Retro W&B License: Apache 2.0


Trained RL agent racing F-Zero Mute City I

Trained PPO agent racing Mute City I — Fire Stingray, 5 laps


ResultsKey InsightArchitectureQuick StartTraining Details


The Problem

F-Zero (SNES, 1991) was labeled an "unsolved challenge" for reinforcement learning in a 2016 academic paper. The reason: the game provides zero intermediate score during racing. The only feedback is a final time after a 5-lap race — roughly 6,000 decision steps of pure exploration before any reward signal. Standard DQN and A3C agents could not even finish a single race.

This project solves F-Zero by reading 58 RAM addresses directly from the SNES emulator to construct dense, per-step rewards from track progress. Combined with a dual-input neural architecture adapted from Linesight (the Trackmania AI that beat 10/12 world records), the agent learns to race competitively in ~10 hours of training.


Results

Metric This Project Human World Record 2016 Paper
Race Time (5 laps) 125.22s 117.96s Did not finish
Gap to WR ~6%
Laps Completed 5/5 5/5 0–2/5
Training Time ~10 hours Years of practice

Bhatt et al., "Playing SNES in the Retro Learning Environment" (2016) reported F-Zero as unsolved — standard DQN/A3C could not complete races due to the extreme sparse reward problem.

Agent behavior at convergence:

  • Consistently completes 5-lap races on Mute City I
  • Uses boost (Super Jet) on 43.6% of steps
  • Employs blast turning (rapid accel toggle) on 3% of steps
  • Applies shoulder lean on 20% of corners

Key Insight: Reward Shaping

F-Zero gives the agent nothing to learn from — no score, no checkpoints, no progress bar. Just 5 laps of silence, then a final time. At 20 decisions/second, the agent must explore ~6,000 steps blindly before receiving a single reward. No RL algorithm can solve this naively.

Our solution: Read 58 RAM addresses from the SNES emulator to extract the car's real-time position, then project it onto a spline-interpolated track centerline to compute dense, per-step progress rewards.

Mute City I track centerline from 58 RAM checkpoints

58 checkpoint coordinates from RAM (blue dots) spline-interpolated into a smooth 660-point centerline. Reward = progress along this curve each step.

Reward per step:

r = (delta / ref) + 0.5 * (delta / ref)^2

where delta = track progress per step, ref = average delta at 135s race pace

The linear term provides learning gradient even at low speeds (critical for cold start). The quadratic term creates accelerating gradient at higher speeds, breaking the plateau where simpler rewards stall.

Reward function analysis

Linear+quadratic shaping provides increasing gradient at higher speeds, breaking the speed plateau that pure linear rewards create.


Architecture

The policy network uses a dual-input architecture adapted from Linesight: a CNN processes game frames while an MLP processes RAM-extracted features, fused into a shared representation for PPO's policy and value heads.

graph LR
    subgraph Observation
        A["Game Screen<br/>4 × 84 × 96 grayscale"]
        B["RAM Features<br/>speed, energy,<br/>track preview (59-dim)"]
    end

    subgraph "Dual-Input Feature Extractor (~2-3M params)"
        A --> CNN["CNN<br/>4→16→32→64→32<br/>LeakyReLU"]
        B --> MLP["MLP<br/>59→128→128<br/>LeakyReLU"]
        CNN -->|"1792"| Concat["Concat"]
        MLP -->|"128"| Concat
        Concat --> Fusion["Linear 1920→1024<br/>+ LeakyReLU"]
    end

    subgraph "PPO Output Heads"
        Fusion --> Policy["Policy Head<br/>MultiDiscrete(3,3,2,2,2)"]
        Fusion --> Value["Value Head<br/>scalar V(s)"]
    end
Loading

Action spaceMultiDiscrete([3, 3, 2, 2, 2]), 5 independent dimensions learned simultaneously:

Dimension Values SNES Button
Steer straight / left / right D-pad
Shoulder none / L-lean / R-lean L / R
Accelerate hold / release B (enables blast turning)
Brake no / yes Y
Boost no / yes A (Super Jet)

Training Pipeline

graph TB
    subgraph "Data Collection"
        E["80× SNES Emulators<br/>(SubprocVecEnv)"] -->|"~900 fps"| R["Rollout Buffer<br/>512 steps × 80 envs"]
    end

    subgraph "Policy Update"
        R --> PPO["PPO<br/>4 epochs, batch 2048<br/>clip=0.2, γ=0.99"]
        PPO -->|"Updated weights"| E
    end

    subgraph "Per-Step Reward Shaping"
        RAM["58 RAM Addresses"] --> Proj["Project onto<br/>660-point spline<br/>centerline"]
        Proj --> Delta["Δ = track progress"]
        Delta --> Reward["r = Δ/ref + 0.5·(Δ/ref)²"]
    end

    E -.->|"reads RAM each step"| RAM
Loading

Quick Start

git clone https://github.com/<your-username>/RL-Gaming.git && cd RL-Gaming
pip install -r requirements.txt

# Place your F-Zero ROM
mkdir -p roms && cp /path/to/F-Zero\ \(USA\).sfc roms/

# Train with PPO (80 parallel SNES emulators, ~900 fps)
python -m training.train --algo ppo --timesteps 50000000

# Evaluate a trained model
python -m evaluation.evaluate --model models/fzero_ppo_final.zip --episodes 10

# Other algorithms
python -m training.train --algo dqn --n-envs 4
python -m training.train --algo iqn --n-envs 80

Requirements: Python 3.10–3.12, F-Zero (USA) ROM (.sfc), GPU optional (~200 MB VRAM)


Training Details

Parameter Value Parameter Value
Learning rate 3e-4 (adaptive) Parallel envs 80
Batch size 2048 Frameskip 3 (20 Hz)
PPO epochs 4 Frame stack 4
γ (gamma) 0.99 Entropy coeff 0.01
GAE λ 0.95 Total timesteps 50M (~10h)
Discount factor analysis

γ=0.99 provides ~5s planning horizon — enough for corner anticipation while keeping value function scale manageable.

Reward weight optimization

Balancing cold-start gradient (linear), high-speed acceleration (quadratic), and training stability.


Comparison to Related Work

Aspect Linesight (Trackmania) This Project (F-Zero SNES)
Algorithm IQN (off-policy) PPO (on-policy)
Action space Continuous (analog) MultiDiscrete (digital SNES)
Reward progress − fixed penalty progress (linear + quadratic)
Throughput ~20 fps (3D game) ~900 fps (SNES emulation)
Result Beat 10/12 world records Within 6% of WR (vs. "unsolvable")

Project Structure

RL-Gaming/
├── env/                          # Gymnasium environment (Stable-Retro wrapper)
│   ├── fzero_env.py             # Main wrapper: dual obs, shaped reward, termination
│   ├── rewards.py               # Dense reward via spline projection (660-pt centerline)
│   ├── observations.py          # CNN frame processing + float feature builder
│   ├── actions.py               # MultiDiscrete ↔ SNES button mapping
│   └── FZero-Snes/              # Stable-Retro integration (58 RAM addresses)
├── network/
│   ├── dual_input.py            # CNN+MLP feature extractor (SB3 compatible)
│   └── iqn.py                   # IQN with branching dueling heads
├── training/
│   ├── train.py                 # CLI entrypoint (PPO / DQN / QR-DQN / IQN)
│   ├── config.py                # All hyperparameters — single source of truth
│   ├── iqn_trainer.py           # Custom IQN training loop
│   └── callbacks.py             # Adaptive LR, best-model saving, W&B logging
├── evaluation/
│   ├── evaluate.py              # Run model, collect race times & metrics
│   └── overlay.py               # Real-time debug overlay on game frames
├── tests/                       # Unit tests (rewards, actions, network, observations)
├── docs/                        # Design docs, experiment log, analysis plots
├── scripts/                     # Utilities (centerline gen, reward analysis, GIF recording)
└── models/                      # Trained checkpoints

Acknowledgments

References

  1. Bhatt et al., "Playing SNES in the Retro Learning Environment" (2016) — labeled F-Zero unsolved for RL
  2. Schulman et al., "Proximal Policy Optimization Algorithms" (2017)
  3. Dabney et al., "Implicit Quantile Networks for Distributional RL" (2018)
@misc{fzero-rl-2026,
  author = {Zhou, Yincheng},
  title  = {F-Zero RL: Solving an "Unsolved" Racing Challenge with Reward Shaping},
  year   = {2026},
  url    = {https://github.com/<your-username>/RL-Gaming}
}

Apache 2.0 License © 2026 Zhou Yincheng

Built with reward engineering, 80 parallel SNES emulators, and a lot of wall collisions.

About

Train an RL agent for F-Zero 1990 to get into world TOP-300 players!

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors