Static Tokenizer#169
Conversation
|
I changed the tokenizer to skip some default tokens. This brings down the input length to a maximum of 1,196 tokens (for a 1,674 character SMILES, CHEBI:156639). Training with this tokenizer is successful on ChEBI:
.* Not comparable, different loss function Overall, the static tokenization run took a bit longer, but got slightly better results (which may not be significant). |
|
@copilot we have unit tests for the ChemDataReader in tests/unit/readers/testChemDataReader.py. We are missing similar tests for the StaticSMILESReader. Please add similar tests ensuring that the StaticSMILESReader always produces the same tokens for the same input, and that is decodes tokens to the original SMILES correctly |
|
Ok, since copilot doesn't want to, I asked claude: Determinism tests (3):
Static vocabulary tests (2):
Decode roundtrip tests (5):
Decode roundtrip tests (5):
Invalid input (1):
|
This implements a new static tokenizer for SMILES (see #166).
The new tokenizer only needs 572 tokens which it achieves by splitting each atom into 5 tokens (element, charge, isotope, stereochemistry and hydrogen count). As demonstrated in #166, all SMILES strings in ChEBI and PubChem can be parsed with this tokenizer.
Also, the implementation includes a decoder that reconstructs SMILES strings as far as possible (some SMILES cannot be reconstructed perfectly since the encoding is not injective. E.g.
[1*]and[2*]both get resolved to*).Todo