SMILES enumeration with exact next-token decoding.
grimace is a Rust-first RDKit add-on for exact rooted SMILES support and
online next-token decoding. It provides:
- exact support enumeration for a molecule under RDKit-style writer flags
- exact token inventories implied by that support
- legal next-token choices from a current SMILES prefix
By "support" we mean the full set of reachable rooted SMILES strings for the chosen writer flags. A "rooted SMILES" here is a SMILES string generated with a fixed starting atom for a connected molecule, or with one rooted fragment/local root inside the preserved fragment order for a disconnected molecule.
Today, that public runtime is intentionally narrow: exact support and decoding
for RDKit's canonical=False, doRandom=True writer regime under the current
stable writer convention.
There are two separate correctness ideas in this project:
- principled SMILES/chemistry semantics: emitted strings should be valid and parse back to the intended graph and stereo assignment
- RDKit writer parity: emitted strings should match RDKit's actual writer support for the supported regime
grimace targets the current stable RDKit writer convention, currently
RDKit 2026.03.1. Older slash/backslash serialization conventions are out of
scope. The dependency floor is rdkit>=2026.3, but exact output parity is
only validated against that current stable writer convention; newer RDKit
releases may still require fixture or expectation updates.
The package metadata declares Python >=3.11 and rdkit>=2026.3. The
currently exercised CI and release matrix is narrower and documented below.
GRIMACE stands for "graph representation integrating multiple alternate chemical equivalents", motivated by research on NMR spectroscopy with language transformers (link).
Warning
grimace is still evolving. The supported public API is usable for the
documented runtime subset, but feature coverage is still limited and some
public details may continue to change between releases.
Important
grimace is distributed under PolyForm-Noncommercial-1.0.0. Commercial
use is not permitted under the current license.
The only supported public Python import name is grimace.
Caution
Install the PyPI distribution named grimace-py; import the package as
grimace. Plain pip install grimace installs an unrelated older package,
not this library.
python -m pip install grimace-pyMain entrypoints:
MolToSmilesEnum(...)Returns the exact SMILES support as an iterator of finished strings.MolToSmilesDecoder(...)Returns an online branch-preserving decoder state.MolToSmilesDeterminizedDecoder(...)Returns an online decoder that merges same-text next choices.MolToSmilesDeviation(...)Reports the first place where a candidate string or token sequence leaves the molecule's supported SMILES language.MolToSmilesTokenInventory(...)Returns the exact set of tokens that can appear in one decoder step.MolToSmilesTokenInventorySuperset(...)Returns a static conservative token inventory for vocabulary coverage.
Supporting public type:
MolToSmilesChoiceEach choice has.textfor the emitted token and.next_statefor the decoder state after taking that token.SmilesDeviationDiagnostic result returned byMolToSmilesDeviation(...).
The public API uses the compiled Rust extension end to end.
The public signatures mirror RDKit flag names and defaults, but the current runtime intentionally supports only a strict subset.
Caution
The signatures preserve RDKit-like defaults for surface compatibility, but
those defaults are not currently supported. A naive
grimace.MolToSmilesEnum(mol) call raises NotImplementedError; pass
canonical=False and doRandom=True explicitly.
Today, pass:
canonical=FalsedoRandom=True- omit
rootedAtAtomor passrootedAtAtom=-1for all-roots behavior - pass
rootedAtAtom >= 0for one explicit root - other negative integer
rootedAtAtomvalues are also accepted for RDKit compatibility and behave like-1, but-1is the preferred public spelling rootedAtAtom=Noneis not supported; omit the argument or use-1instead
Unsupported flag combinations fail fast with NotImplementedError. Other
invalid public inputs can still raise more specific exceptions such as
IndexError or ValueError.
The most important rootedAtAtom semantics are:
rootedAtAtom=<idx>uses one explicit starting atom for connected molecules.rootedAtAtom=-1forMolToSmilesEnum(...)returns the exact support unioned across all root atoms.rootedAtAtom=-1for the decoder classes starts from one merged all-roots decoder state.rootedAtAtom=-1forMolToSmilesTokenInventory(...)andMolToSmilesTokenInventorySuperset(...)returns the token inventory unioned across all root atoms.- omitting
rootedAtAtommeans the same thing as passing-1. - other negative integer
rootedAtAtomvalues also behave like-1, to stay close to RDKit's public binding behavior. rootedAtAtom=Noneis rejected across the public API, matching RDKit's binding behavior.- for disconnected molecules, fragment order is preserved; a nonnegative
rootedAtAtomselects the rooted fragment and its local root atom within that fixed fragment order, but non-rooted fragments can still vary internally.
All examples below use the current supported runtime subset:
FLAGS = dict(
canonical=False,
doRandom=True,
)If you want the exact support across all possible roots, rootedAtAtom=-1 is
the simplest public entrypoint:
from rdkit import Chem
import grimace
mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O")
all_smiles = tuple(
grimace.MolToSmilesEnum(
mol,
rootedAtAtom=-1,
isomericSmiles=False,
**FLAGS,
)
)
assert len(all_smiles) == 304If instead you want the exact support from one specific root atom, pass that root explicitly:
root_0_smiles = tuple(
grimace.MolToSmilesEnum(
mol,
rootedAtAtom=0,
isomericSmiles=False,
**FLAGS,
)
)MolToSmilesDecoder(...) is branch-preserving. It exposes the exact legal next
choices for the current prefix, and each choice points to a successor state.
decoder = grimace.MolToSmilesDecoder(
mol,
rootedAtAtom=0,
isomericSmiles=False,
**FLAGS,
)
for _ in range(7):
prefix = decoder.prefix if decoder.prefix else '""'
print(f"{prefix} -> {[choice.text for choice in decoder.next_choices]}")
decoder = decoder.next_choices[0].next_stateEarly output on aspirin looks like:
"" -> ['C']
C -> ['C']
CC -> ['(', '(']
CC( -> ['=']
CC(= -> ['O']
CC(=O -> [')']
CC(=O) -> ['O']
Notice the duplicate "(" at CC. Those are different branches with the same
emitted token text. That is deliberate: MolToSmilesDecoder(...) preserves
branch identity instead of merging it away.
MolToSmilesDeterminizedDecoder(...) exposes at most one choice per token text
by merging same-text continuations into one combined state.
For example, the merged all-roots decoder can trace one exact route to
c1(ccccc1OC(=O)C)C(O)=O for aspirin:
route = [
"c", "1", "(", "c", "c", "c", "c", "c", "1",
"O", "C", "(", "=", "O", ")", "C", ")",
"C", "(", "O", ")", "=", "O",
]
decoder = grimace.MolToSmilesDeterminizedDecoder(
mol,
rootedAtAtom=-1,
isomericSmiles=False,
**FLAGS,
)
for token in route:
choices = {choice.text: choice.next_state for choice in decoder.next_choices}
decoder = choices[token]
assert decoder.is_terminal
assert decoder.prefix == "".join(route)The first few merged decisions on that route are:
"": choose"c"from["C", "O", "c"]"c1": choose"("from["(", "c"]"c1(": choose"c"from["O", "c", "C"]"c1(ccccc1": choose"O"from["C", "O"]
MolToSmilesDeviation(...) returns None for an accepted candidate, otherwise
it reports the first mismatch and the legal next Grimace token texts.
small = Chem.MolFromSmiles("CCO")
kwargs = dict(rootedAtAtom=-1, isomericSmiles=False, **FLAGS)
assert grimace.MolToSmilesDeviation(small, "CCO", **kwargs) is None
deviation = grimace.MolToSmilesDeviation(small, "CCN", **kwargs)
assert deviation.accepted_text == "CC"
assert deviation.rejected_text == "N"
assert deviation.legal_next_tokens == ("O",)String candidates are matched as text. Sequence candidates are atomic external tokens, so boundaries matter:
grimace.MolToSmilesDeviation(small, "CCl", **kwargs).accepted_text
# 'CC'
grimace.MolToSmilesDeviation(small, ("C", "Cl"), **kwargs).accepted_text
# 'C'MolToSmilesTokenInventory(...) answers a different question: not "what full
strings are possible?" but "what one-step tokens can ever appear?"
inventory = grimace.MolToSmilesTokenInventory(
mol,
rootedAtAtom=-1,
isomericSmiles=False,
**FLAGS,
)
assert "C" in inventory
assert "(" in inventory
assert "c" in inventoryThe result is a sorted tuple of distinct tokens.
For fast dataset vocabulary coverage, use the static inventory:
vocab_tokens = set()
for mol in mols:
vocab_tokens.update(
grimace.MolToSmilesTokenInventorySuperset(
mol,
rootedAtAtom=-1,
isomericSmiles=True,
**FLAGS,
)
)For the same molecule and flags, the exact inventory is contained in the superset inventory.
A token is one string emitted by one decoder transition. Tokens are defined by the walker itself, not by splitting a finished SMILES into characters and not by integer token IDs. They come from two places:
- the prepared graph's RDKit-style atom and bond tokens, such as
C,c,Cl,[C@H],=,/, or\\ - SMILES syntax literals inserted by the walker, such as
(,),1, or%10
So a token is exactly one appendable SMILES fragment for the current state. It may be one character or several.
Install the PyPI distribution named grimace-py:
python -m pip install grimace-pyThen import it as grimace:
import grimacePlain pip install grimace installs an unrelated older project with the same
name, not this library.
PyPI and GitHub release assets currently publish Linux x86_64 wheels for
CPython 3.12 and 3.13. Other environments may require a source build and
are not covered by the release wheels.
Current continuously exercised matrix:
- Linux source-tree tests on CPython
3.12 - Linux wheel build and smoke tests on CPython
3.12and3.13 - source distribution build, metadata validation, and installed-artifact smoke tests
Other Python versions and non-Linux platforms are expected source-build paths,
not part of the current release asset or CI matrix.
Python 3.11 is in that source-build category today: declared, but not part of
the current CI matrix.
GitHub release wheels are also available:
| System | 3.12 | 3.13 |
|---|---|---|
| Linux x86_64 | wheel | wheel |
The built package depends on rdkit>=2026.3.
For local development or a source build, you need:
- a Rust toolchain with
rustc >= 1.83 maturin
Then:
python -m venv .venv
. .venv/bin/activate
python -m pip install maturin
maturin develop --releaseThe opt-in timing benchmark generates two artifacts:
- docs/timings.tsv: raw measured summary data
- docs/timings.md: rendered table and column descriptions
The table reports both decoder variants:
- branch-preserving exhaustive traversal via
MolToSmilesDecoder(...) - determinized exhaustive traversal via
MolToSmilesDeterminizedDecoder(...)
Current takeaway from the generated table:
- the table does not time the direct public
MolToSmilesEnum(..., rootedAtAtom=-1)path - the published
Grimace enumcolumn is the more conservative exact baseline: explicit union over per-rootMolToSmilesEnum(..., rootedAtAtom=root_idx)calls - some merged decoder rows are numerically lower than that per-root union column, so this table does not prove a universal exact-method ranking
MolToSmilesDeterminizedDecoder(...)can reduce exhaustive decoder cost on some molecules- the table is still a small curated benchmark: 9 molecules, 2 writer modes, 7 timing repeats, and one development machine
- this is not a workload study and not an exact-versus-exact comparison
- the
Grimace enumrow times explicit union over per-rootMolToSmilesEnum(..., rootedAtAtom=root_idx)calls, not the direct publicMolToSmilesEnum(..., rootedAtAtom=-1)path - the RDKit columns are not exact enumeration; they are random sampling until
RDKit happens to reach
1/2or full support - because of that, RDKit can be cheaper when you only want a few random strings, especially on small cases
- but on the larger molecules in this table, Grimace exact methods were usually much faster than this RDKit sampling-to-coverage baseline when guaranteed full support was the goal
Regenerate it with:
RUN_PERF_TESTS=1 PYTHONPATH=python:. .venv/bin/python -m unittest tests.perf.test_readme_timings -qThe public API keeps RDKit MolToSmiles flag names, but it does not aim for
full RDKit writer-surface parity yet.
For terminology and test policy, see Correctness contracts. In short: RDKit writer-matching behavior is intentionally separate from the principled SMILES/chemistry semantics layer. RDKit-specific traversal and directional-bond placement rules belong to the writer-parity layer, not the generic semantic layer.
Known stereo writer-parity work in progress:
minimal_nonstereo_double_hazard(C/N=C1C=C/C(=N/C)[N-]/1) currently has the same direction-erased skeleton support as RDKit, but some slash/backslash markers are placed in different writer positions.reduced_porphyrin_traversal_couplingcurrently exercises a larger traversal-coupling gap where RDKit's writer policy narrows the traversal assignment support beyond the semantic carrier choices.- A separate RDKit-known quirk,
dative_carbonyl_stereo_annotation_drops_on_smiles_roundtrip, is tracked as observed RDKit behavior rather than a generic SMILES semantics rule.
This work is active on the stereo-constraint-model branch and is not yet in
the released mainline runtime. The goal is exact RDKit writer support for the
documented runtime subset, without mixing RDKit-specific spelling quirks into
the generic semantic layer.
Current public runtime contract:
canonical=FalsedoRandom=True- omit
rootedAtAtomor passrootedAtAtom=-1for all-roots behavior - pass
rootedAtAtom >= 0for one explicit root - other negative integer
rootedAtAtomvalues are also accepted for RDKit compatibility, but-1is the preferred public spelling rootedAtAtom=Noneis not supported; omit the argument or use-1
Supported writer flags today:
isomericSmileskekuleSmilesallBondsExplicitallHsExplicitignoreAtomMapNumbers
Anything outside that runtime subset fails fast. Unsupported flag combinations
raise NotImplementedError. Other invalid public inputs can still raise more
specific exceptions such as IndexError or ValueError.
Disconnected molecules are supported by the public APIs.
MolToSmilesEnum(...),MolToSmilesDecoder(...), andMolToSmilesDeterminizedDecoder(...)compose fragment-wise behavior directly.MolToSmilesTokenInventory(...)returns the union of fragment inventories plus the"."separator token.MolToSmilesTokenInventorySuperset(...)returns the corresponding static fragment inventory union plus the"."separator token.
grimace is source-available under PolyForm Noncommercial 1.0.0.
Third-party components remain under their own licenses; see
THIRD_PARTY_NOTICES.md.
Commercial use requires a separate commercial license from the author.
The software is provided as is, without warranty or liability, to the extent
allowed by law.