Skip to content

wildstyl3r/piecer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

piecer

An unicode codepoint based BPE tokenizer with byte fallback. The core vocabulary always includes full byte range (0x00-0xff) + any codepoints found in the training string.

At the encoding stage, any input that cannot be mapped to multi-byte vocabulary elements will be represented as a sequence of byte tokens. During the decoding stage, the tokenizer attempts to reconstruct valid, printable Unicode characters or prints hex code sequences like <e9><e9><ff> if it fails.

Relies on priority-queue and daachorse for pair merging and pattern matching.

About

Unicode codepoint-based BPE tokenizer with byte fallback

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages