Static Tokenizer by sfluegel05 · Pull Request #169 · ChEB-AI/python-chebai

sfluegel05 · 2026-05-06T18:21:59Z

This implements a new static tokenizer for SMILES (see #166).

The new tokenizer only needs 572 tokens which it achieves by splitting each atom into 5 tokens (element, charge, isotope, stereochemistry and hydrogen count). As demonstrated in #166, all SMILES strings in ChEBI and PubChem can be parsed with this tokenizer.

Also, the implementation includes a decoder that reconstructs SMILES strings as far as possible (some SMILES cannot be reconstructed perfectly since the encoding is not injective. E.g. [1*] and [2*] both get resolved to *).

Todo

check the implications of longer inputs for the ELECTRA model
make this reader the default for PubChem and ChEBI classes

sfluegel05 · 2026-05-11T07:39:04Z

I changed the tokenizer to skip some default tokens. This brings down the input length to a maximum of 1,196 tokens (for a 1,674 character SMILES, CHEBI:156639).

Training with this tokenizer is successful on ChEBI:

Run	Tokenisation	Time on A100	Micro-F1	Macro-F1
1	From list	15h 36m	*	*
2	From list	**	0.89131	0.6193
3	Static	23h 0m	0.89557	0.63011

.* Not comparable, different loss function
.** Not comparable, different GPU

Overall, the static tokenization run took a bit longer, but got slightly better results (which may not be significant).

sfluegel05 · 2026-05-27T16:51:05Z

@copilot we have unit tests for the ChemDataReader in tests/unit/readers/testChemDataReader.py. We are missing similar tests for the StaticSMILESReader. Please add similar tests ensuring that the StaticSMILESReader always produces the same tokens for the same input, and that is decodes tokens to the original SMILES correctly

sfluegel05 · 2026-05-27T17:09:03Z

Ok, since copilot doesn't want to, I asked claude:

Determinism tests (3):

test_same_input_produces_same_tokens — identical SMILES called twice with different reader instances gives identical output
test_non_canonical_input_matches_canonical — different representations of aspirin produce the same tokens (since RDKit canonicalizes before encoding)

Static vocabulary tests (2):

test_vocabulary_does_not_grow — encoding multiple SMILES leaves vocab size unchanged (contrast with ChemDataReader which appends new tokens)
test_unknown_tokens_use_unknown_idx — tokens outside the vocabulary map to UNKNOWN_TOKEN_IDX=3 rather than being added

Decode roundtrip tests (5):

Simple organic molecule (aspirin)

Decode roundtrip tests (5):

Simple organic molecule (aspirin)
Stereochemistry ([C@H])
Isotope label ([13C])
Charged atom ([NH4+])
Bracketed aromatic atom ([nH] in benzimidazole)

Invalid input (1):

test_invalid_smiles_returns_none — non-parseable strings return None

sfluegel05 added 3 commits May 6, 2026 15:24

add static smiles tokenizer

e33ac08

update for PubChem tokens

17021bf

correctly reassamble SMILES (as far as possible)

a219d45

sfluegel05 linked an issue May 6, 2026 that may be closed by this pull request

New tokenisation #166

Closed

sfluegel05 added 3 commits May 7, 2026 10:44

make static reader default for chebi

cfbf418

change electra vocab size

aaf5c6e

omit default tokens -> shortens overall representation massively

8ed618c

sfluegel05 added 3 commits May 27, 2026 18:19

Merge branch 'dev' into feature/static-tokenisation

93454f0

make static tokenization default for pubchem

03baa9c

fix non-determinstic set operation

2dc40b3

Copilot started work on behalf of sfluegel05 May 27, 2026 16:51 View session

Copilot stopped work on behalf of sfluegel05 due to an error May 27, 2026 16:52
Request session.create failed with message: Model "gpt-5.3-codex" is not available.

add tests for static SMILES reader

b5a5039

sfluegel05 marked this pull request as ready for review May 27, 2026 17:09

sfluegel05 merged commit 122e45c into dev May 27, 2026
5 checks passed

sfluegel05 deleted the feature/static-tokenisation branch May 27, 2026 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static Tokenizer#169

Static Tokenizer#169
sfluegel05 merged 10 commits into
devfrom
feature/static-tokenisation

sfluegel05 commented May 6, 2026 •

edited

Loading

Uh oh!

sfluegel05 commented May 11, 2026

Uh oh!

sfluegel05 commented May 27, 2026

Uh oh!

sfluegel05 commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sfluegel05 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Todo

Uh oh!

sfluegel05 commented May 11, 2026

Uh oh!

sfluegel05 commented May 27, 2026

Uh oh!

sfluegel05 commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sfluegel05 commented May 6, 2026 •

edited

Loading