Skip to content

Static Tokenizer#169

Merged
sfluegel05 merged 10 commits into
devfrom
feature/static-tokenisation
May 27, 2026
Merged

Static Tokenizer#169
sfluegel05 merged 10 commits into
devfrom
feature/static-tokenisation

Conversation

@sfluegel05
Copy link
Copy Markdown
Collaborator

@sfluegel05 sfluegel05 commented May 6, 2026

This implements a new static tokenizer for SMILES (see #166).

The new tokenizer only needs 572 tokens which it achieves by splitting each atom into 5 tokens (element, charge, isotope, stereochemistry and hydrogen count). As demonstrated in #166, all SMILES strings in ChEBI and PubChem can be parsed with this tokenizer.

Also, the implementation includes a decoder that reconstructs SMILES strings as far as possible (some SMILES cannot be reconstructed perfectly since the encoding is not injective. E.g. [1*] and [2*] both get resolved to *).

Todo

  • check the implications of longer inputs for the ELECTRA model
  • make this reader the default for PubChem and ChEBI classes

@sfluegel05 sfluegel05 linked an issue May 6, 2026 that may be closed by this pull request
@sfluegel05
Copy link
Copy Markdown
Collaborator Author

I changed the tokenizer to skip some default tokens. This brings down the input length to a maximum of 1,196 tokens (for a 1,674 character SMILES, CHEBI:156639).

Training with this tokenizer is successful on ChEBI:

Run Tokenisation Time on A100 Micro-F1 Macro-F1
1 From list 15h 36m * *
2 From list ** 0.89131 0.6193
3 Static 23h 0m 0.89557 0.63011

.* Not comparable, different loss function
.** Not comparable, different GPU

Overall, the static tokenization run took a bit longer, but got slightly better results (which may not be significant).

@sfluegel05
Copy link
Copy Markdown
Collaborator Author

@copilot we have unit tests for the ChemDataReader in tests/unit/readers/testChemDataReader.py. We are missing similar tests for the StaticSMILESReader. Please add similar tests ensuring that the StaticSMILESReader always produces the same tokens for the same input, and that is decodes tokens to the original SMILES correctly

@sfluegel05
Copy link
Copy Markdown
Collaborator Author

Ok, since copilot doesn't want to, I asked claude:

Determinism tests (3):

  • test_same_input_produces_same_tokens — identical SMILES called twice with different reader instances gives identical output
  • test_non_canonical_input_matches_canonical — different representations of aspirin produce the same tokens (since RDKit canonicalizes before encoding)

Static vocabulary tests (2):

  • test_vocabulary_does_not_grow — encoding multiple SMILES leaves vocab size unchanged (contrast with ChemDataReader which appends new tokens)
  • test_unknown_tokens_use_unknown_idx — tokens outside the vocabulary map to UNKNOWN_TOKEN_IDX=3 rather than being added

Decode roundtrip tests (5):

  • Simple organic molecule (aspirin)

Decode roundtrip tests (5):

  • Simple organic molecule (aspirin)
  • Stereochemistry ([C@H])
  • Isotope label ([13C])
  • Charged atom ([NH4+])
  • Bracketed aromatic atom ([nH] in benzimidazole)

Invalid input (1):

  • test_invalid_smiles_returns_none — non-parseable strings return None

@sfluegel05 sfluegel05 marked this pull request as ready for review May 27, 2026 17:09
@sfluegel05 sfluegel05 merged commit 122e45c into dev May 27, 2026
5 checks passed
@sfluegel05 sfluegel05 deleted the feature/static-tokenisation branch May 27, 2026 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New tokenisation

1 participant