Here I am implementing the agents as I read through them in the Sutton and Bartow book: Reinforcement Learning An Introduction 2nd edition. Unless stated otherwise these are trained to find q* using ε-greedy policy then evaluated using a greedy policy on the q* obtained through training.
- On-policy first-visit Monte Carlo
- Off-policy Monte Carlo
- SARSA(0)
- Q-Learning
- Expected SARSA
- Double Q-Learning
- n-step SARSA
- Off-Policy n-step SARSA
- n-step Tree Backup
- Tabular Dyna-Q
- Episodic Semi-Gradient Sarsa
- SARSA(λ)
I am testing each of these using different gymnasium environments.