This repo is inspired by and follows along Andrej Karpathy's Neural Networks: Zero to Hero youtube playlist.
In this repo, I built a GPT-2 clone based on the GPT-2 paper and the official GPT-2 repository. This model is build using only decoder self-attention blocks and does not implement the encoder block with cross-attention architecture as shown in the GPT-2 paper.
The model in this repo has 124M parameters and uses the FineWeb-Edu dataset to train. Specifically, the FineWeb-Edu dataset with 10B tokens.
Evaluation was done using Hellaswag and compared to the 124M parameter GPT-2 model and the 124M parameter GPT-3 model.
After evaluating using both the cross entropy loss and the Hellaswag evaluation, I obtained these results:
In the image above, you can find the minimum training and validation loss and the maximum Hellaswag accuracy.
The chart on the left shows the loss comparison between myGPT-2 and the OpenAI 124M GPT-2 Model. It's very cool that myGPT-2 was able to beat the OpenAI Model!
The chart on the right shows the Hellaswag accuracy comparison between myGPT-2 and the OpenAI 124M GPT-2 Model and the OpenAI 124M GPT-3 Model. Again, myGPT-2 was able to beat the OpenAI 124M GPT-2 model, but it did not get close to the OpenAI 124M GPT-3 Model.