piecer

An unicode codepoint based BPE tokenizer with byte fallback. The core vocabulary always includes full byte range (0x00-0xff) + any codepoints found in the training string.

At the encoding stage, any input that cannot be mapped to multi-byte vocabulary elements will be represented as a sequence of byte tokens. During the decoding stage, the tokenizer attempts to reconstruct valid, printable Unicode characters or prints hex code sequences like <e9><e9><ff> if it fails.

Relies on priority-queue and daachorse for pair merging and pattern matching.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
shakespeare.txt		shakespeare.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

piecer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

piecer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages