GPT-style language model with Byte Pair Encoding tokenizer, built from scratch in PyTorch.
-
Updated
May 5, 2026 - Python
GPT-style language model with Byte Pair Encoding tokenizer, built from scratch in PyTorch.
Java library implementing Byte-Pair Encoding Tokenization
A lightweight, from-scratch implementation of Byte Pair Encoding (BPE) tokenization in Python.
This project implements a tokenizer based on the Byte Pair Encoding (BPE) algorithm, with additional custom tokenizers, including one similar to the GPT-4 tokenizer.
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
From-scratch Marathi BPE tokenizer with Flask API and web interface for encoding/decoding Marathi text using Byte Pair Encoding algorithm.
A python package to build a corpus vocabulary using the byte pair methodology and also a tokenizer to tokenize input texts based on the built vocab.
MayaTok is a Byte Pair Encoding based Tokenizer.
A web app to compare pre-built or self-built tokenizers
Strings Tokenization with Byte Pair Encoding (BPE).
Natural Language Processing course assignments @ University of Tehran
self made byte-pair-encoding tokenizer
implementation of BPE algorithm and training of the tokens generated
Byte Pair Encoding (BPE) in the Zig programming language (0.13.0)
This repository is reimplementation of Transformer model which was introduced in 2017 NeurIPS paper "Attention is all you need"
TF-IDF Calculation
A pure Python implementation of Byte Pair Encoding (BPE) tokenizer. Train on any text, encode/decode with saved models, and explore BPE tokenization fundamentals.
Byte-pair Algorithm used in GPT Tokenization based on the paper "Language Model are unsupervised multitask learners"
Add a description, image, and links to the bytepairencoding topic page so that developers can more easily learn about it.
To associate your repository with the bytepairencoding topic, visit your repo's landing page and select "manage topics."