GRIMACE

SMILES enumeration with exact next-token decoding.

grimace is a Rust-first RDKit add-on for exact rooted SMILES support and online next-token decoding. It provides:

exact support enumeration for a molecule under RDKit-style writer flags
exact token inventories implied by that support
legal next-token choices from a current SMILES prefix

By "support" we mean the full set of reachable rooted SMILES strings for the chosen writer flags. A "rooted SMILES" here is a SMILES string generated with a fixed starting atom for a connected molecule, or with one rooted fragment/local root inside the preserved fragment order for a disconnected molecule.

Today, that public runtime is intentionally narrow: exact support and decoding for RDKit's canonical=False, doRandom=True writer regime under the current stable writer convention.

There are two separate correctness ideas in this project:

principled SMILES/chemistry semantics: emitted strings should be valid and parse back to the intended graph and stereo assignment
RDKit writer parity: emitted strings should match RDKit's actual writer support for the supported regime

grimace targets the current stable RDKit writer convention, currently RDKit 2026.03.1. Older slash/backslash serialization conventions are out of scope. The dependency floor is rdkit>=2026.3, but exact output parity is only validated against that current stable writer convention; newer RDKit releases may still require fixture or expectation updates.

The package metadata declares Python >=3.11 and rdkit>=2026.3. The currently exercised CI and release matrix is narrower and documented below.

GRIMACE stands for "graph representation integrating multiple alternate chemical equivalents", motivated by research on NMR spectroscopy with language transformers (link).

Warning

grimace is still evolving. The supported public API is usable for the documented runtime subset, but feature coverage is still limited and some public details may continue to change between releases.

Important

grimace is distributed under PolyForm-Noncommercial-1.0.0. Commercial use is not permitted under the current license.

Choose the API

The only supported public Python import name is grimace.

Caution

Install the PyPI distribution named grimace-py; import the package as grimace. Plain pip install grimace installs an unrelated older package, not this library.

python -m pip install grimace-py

Main entrypoints:

MolToSmilesEnum(...) Returns the exact SMILES support as an iterator of finished strings.
MolToSmilesDecoder(...) Returns an online branch-preserving decoder state.
MolToSmilesDeterminizedDecoder(...) Returns an online decoder that merges same-text next choices.
MolToSmilesDeviation(...) Reports the first place where a candidate string or token sequence leaves the molecule's supported SMILES language.
MolToSmilesTokenInventory(...) Returns the exact set of tokens that can appear in one decoder step.
MolToSmilesTokenInventorySuperset(...) Returns a static conservative token inventory for vocabulary coverage.

Supporting public type:

MolToSmilesChoice Each choice has .text for the emitted token and .next_state for the decoder state after taking that token.
SmilesDeviation Diagnostic result returned by MolToSmilesDeviation(...).

The public API uses the compiled Rust extension end to end.

Important runtime requirements today

The public signatures mirror RDKit flag names and defaults, but the current runtime intentionally supports only a strict subset.

Caution

The signatures preserve RDKit-like defaults for surface compatibility, but those defaults are not currently supported. A naive grimace.MolToSmilesEnum(mol) call raises NotImplementedError; pass canonical=False and doRandom=True explicitly.

Today, pass:

canonical=False
doRandom=True
omit rootedAtAtom or pass rootedAtAtom=-1 for all-roots behavior
pass rootedAtAtom >= 0 for one explicit root
other negative integer rootedAtAtom values are also accepted for RDKit compatibility and behave like -1, but -1 is the preferred public spelling
rootedAtAtom=None is not supported; omit the argument or use -1 instead

Unsupported flag combinations fail fast with NotImplementedError. Other invalid public inputs can still raise more specific exceptions such as IndexError or ValueError.

The most important rootedAtAtom semantics are:

rootedAtAtom=<idx> uses one explicit starting atom for connected molecules.
rootedAtAtom=-1 for MolToSmilesEnum(...) returns the exact support unioned across all root atoms.
rootedAtAtom=-1 for the decoder classes starts from one merged all-roots decoder state.
rootedAtAtom=-1 for MolToSmilesTokenInventory(...) and MolToSmilesTokenInventorySuperset(...) returns the token inventory unioned across all root atoms.
omitting rootedAtAtom means the same thing as passing -1.
other negative integer rootedAtAtom values also behave like -1, to stay close to RDKit's public binding behavior.
rootedAtAtom=None is rejected across the public API, matching RDKit's binding behavior.
for disconnected molecules, fragment order is preserved; a nonnegative rootedAtAtom selects the rooted fragment and its local root atom within that fixed fragment order, but non-rooted fragments can still vary internally.

Quickstart

All examples below use the current supported runtime subset:

FLAGS = dict(
    canonical=False,
    doRandom=True,
)

1. Enumerate the exact support

If you want the exact support across all possible roots, rootedAtAtom=-1 is the simplest public entrypoint:

from rdkit import Chem
import grimace

mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O")

all_smiles = tuple(
    grimace.MolToSmilesEnum(
        mol,
        rootedAtAtom=-1,
        isomericSmiles=False,
        **FLAGS,
    )
)

assert len(all_smiles) == 304

If instead you want the exact support from one specific root atom, pass that root explicitly:

root_0_smiles = tuple(
    grimace.MolToSmilesEnum(
        mol,
        rootedAtAtom=0,
        isomericSmiles=False,
        **FLAGS,
    )
)

2. Decode online, one token at a time

MolToSmilesDecoder(...) is branch-preserving. It exposes the exact legal next choices for the current prefix, and each choice points to a successor state.

decoder = grimace.MolToSmilesDecoder(
    mol,
    rootedAtAtom=0,
    isomericSmiles=False,
    **FLAGS,
)

for _ in range(7):
    prefix = decoder.prefix if decoder.prefix else '""'
    print(f"{prefix} -> {[choice.text for choice in decoder.next_choices]}")
    decoder = decoder.next_choices[0].next_state

Early output on aspirin looks like:

"" -> ['C']
C -> ['C']
CC -> ['(', '(']
CC( -> ['=']
CC(= -> ['O']
CC(=O -> [')']
CC(=O) -> ['O']

Notice the duplicate "(" at CC. Those are different branches with the same emitted token text. That is deliberate: MolToSmilesDecoder(...) preserves branch identity instead of merging it away.

3. Merge same-text choices when you only care about token text

MolToSmilesDeterminizedDecoder(...) exposes at most one choice per token text by merging same-text continuations into one combined state.

For example, the merged all-roots decoder can trace one exact route to c1(ccccc1OC(=O)C)C(O)=O for aspirin:

route = [
    "c", "1", "(", "c", "c", "c", "c", "c", "1",
    "O", "C", "(", "=", "O", ")", "C", ")",
    "C", "(", "O", ")", "=", "O",
]

decoder = grimace.MolToSmilesDeterminizedDecoder(
    mol,
    rootedAtAtom=-1,
    isomericSmiles=False,
    **FLAGS,
)

for token in route:
    choices = {choice.text: choice.next_state for choice in decoder.next_choices}
    decoder = choices[token]

assert decoder.is_terminal
assert decoder.prefix == "".join(route)

The first few merged decisions on that route are:

"": choose "c" from ["C", "O", "c"]
"c1": choose "(" from ["(", "c"]
"c1(": choose "c" from ["O", "c", "C"]
"c1(ccccc1": choose "O" from ["C", "O"]

4. Diagnose a candidate serialization

MolToSmilesDeviation(...) returns None for an accepted candidate, otherwise it reports the first mismatch and the legal next Grimace token texts.

small = Chem.MolFromSmiles("CCO")
kwargs = dict(rootedAtAtom=-1, isomericSmiles=False, **FLAGS)

assert grimace.MolToSmilesDeviation(small, "CCO", **kwargs) is None

deviation = grimace.MolToSmilesDeviation(small, "CCN", **kwargs)
assert deviation.accepted_text == "CC"
assert deviation.rejected_text == "N"
assert deviation.legal_next_tokens == ("O",)

String candidates are matched as text. Sequence candidates are atomic external tokens, so boundaries matter:

grimace.MolToSmilesDeviation(small, "CCl", **kwargs).accepted_text
# 'CC'

grimace.MolToSmilesDeviation(small, ("C", "Cl"), **kwargs).accepted_text
# 'C'

5. Ask for the exact token inventory

MolToSmilesTokenInventory(...) answers a different question: not "what full strings are possible?" but "what one-step tokens can ever appear?"

inventory = grimace.MolToSmilesTokenInventory(
    mol,
    rootedAtAtom=-1,
    isomericSmiles=False,
    **FLAGS,
)

assert "C" in inventory
assert "(" in inventory
assert "c" in inventory

The result is a sorted tuple of distinct tokens.

For fast dataset vocabulary coverage, use the static inventory:

vocab_tokens = set()
for mol in mols:
    vocab_tokens.update(
        grimace.MolToSmilesTokenInventorySuperset(
            mol,
            rootedAtAtom=-1,
            isomericSmiles=True,
            **FLAGS,
        )
    )

For the same molecule and flags, the exact inventory is contained in the superset inventory.

What counts as a token?

A token is one string emitted by one decoder transition. Tokens are defined by the walker itself, not by splitting a finished SMILES into characters and not by integer token IDs. They come from two places:

the prepared graph's RDKit-style atom and bond tokens, such as C, c, Cl, [C@H], =, /, or \\
SMILES syntax literals inserted by the walker, such as (, ), 1, or %10

So a token is exactly one appendable SMILES fragment for the current state. It may be one character or several.

Installation

Install the PyPI distribution named grimace-py:

python -m pip install grimace-py

Then import it as grimace:

import grimace

Plain pip install grimace installs an unrelated older project with the same name, not this library.

PyPI and GitHub release assets currently publish Linux x86_64 wheels for CPython 3.12 and 3.13. Other environments may require a source build and are not covered by the release wheels.

Current continuously exercised matrix:

Linux source-tree tests on CPython 3.12
Linux wheel build and smoke tests on CPython 3.12 and 3.13
source distribution build, metadata validation, and installed-artifact smoke tests

Other Python versions and non-Linux platforms are expected source-build paths, not part of the current release asset or CI matrix. Python 3.11 is in that source-build category today: declared, but not part of the current CI matrix.

GitHub release wheels are also available:

System	3.12	3.13
Linux x86_64	wheel	wheel

The built package depends on rdkit>=2026.3.

For local development or a source build, you need:

a Rust toolchain with rustc >= 1.83
maturin

Then:

python -m venv .venv
. .venv/bin/activate
python -m pip install maturin
maturin develop --release

Timings

The opt-in timing benchmark generates two artifacts:

docs/timings.tsv: raw measured summary data
docs/timings.md: rendered table and column descriptions

The table reports both decoder variants:

branch-preserving exhaustive traversal via MolToSmilesDecoder(...)
determinized exhaustive traversal via MolToSmilesDeterminizedDecoder(...)

Current takeaway from the generated table:

the table does not time the direct public MolToSmilesEnum(..., rootedAtAtom=-1) path
the published Grimace enum column is the more conservative exact baseline: explicit union over per-root MolToSmilesEnum(..., rootedAtAtom=root_idx) calls
some merged decoder rows are numerically lower than that per-root union column, so this table does not prove a universal exact-method ranking
MolToSmilesDeterminizedDecoder(...) can reduce exhaustive decoder cost on some molecules
the table is still a small curated benchmark: 9 molecules, 2 writer modes, 7 timing repeats, and one development machine
this is not a workload study and not an exact-versus-exact comparison
the Grimace enum row times explicit union over per-root MolToSmilesEnum(..., rootedAtAtom=root_idx) calls, not the direct public MolToSmilesEnum(..., rootedAtAtom=-1) path
the RDKit columns are not exact enumeration; they are random sampling until RDKit happens to reach 1/2 or full support
because of that, RDKit can be cheaper when you only want a few random strings, especially on small cases
but on the larger molecules in this table, Grimace exact methods were usually much faster than this RDKit sampling-to-coverage baseline when guaranteed full support was the goal

Regenerate it with:

RUN_PERF_TESTS=1 PYTHONPATH=python:. .venv/bin/python -m unittest tests.perf.test_readme_timings -q

Current limits

The public API keeps RDKit MolToSmiles flag names, but it does not aim for full RDKit writer-surface parity yet.

For terminology and test policy, see Correctness contracts. In short: RDKit writer-matching behavior is intentionally separate from the principled SMILES/chemistry semantics layer. RDKit-specific traversal and directional-bond placement rules belong to the writer-parity layer, not the generic semantic layer.

Known stereo writer-parity work in progress:

minimal_nonstereo_double_hazard (C/N=C1C=C/C(=N/C)[N-]/1) currently has the same direction-erased skeleton support as RDKit, but some slash/backslash markers are placed in different writer positions.
reduced_porphyrin_traversal_coupling currently exercises a larger traversal-coupling gap where RDKit's writer policy narrows the traversal assignment support beyond the semantic carrier choices.
A separate RDKit-known quirk, dative_carbonyl_stereo_annotation_drops_on_smiles_roundtrip, is tracked as observed RDKit behavior rather than a generic SMILES semantics rule.

This work is active on the stereo-constraint-model branch and is not yet in the released mainline runtime. The goal is exact RDKit writer support for the documented runtime subset, without mixing RDKit-specific spelling quirks into the generic semantic layer.

Current public runtime contract:

canonical=False
doRandom=True
omit rootedAtAtom or pass rootedAtAtom=-1 for all-roots behavior
pass rootedAtAtom >= 0 for one explicit root
other negative integer rootedAtAtom values are also accepted for RDKit compatibility, but -1 is the preferred public spelling
rootedAtAtom=None is not supported; omit the argument or use -1

Supported writer flags today:

isomericSmiles
kekuleSmiles
allBondsExplicit
allHsExplicit
ignoreAtomMapNumbers

Anything outside that runtime subset fails fast. Unsupported flag combinations raise NotImplementedError. Other invalid public inputs can still raise more specific exceptions such as IndexError or ValueError.

Disconnected molecules are supported by the public APIs.

MolToSmilesEnum(...), MolToSmilesDecoder(...), and MolToSmilesDeterminizedDecoder(...) compose fragment-wise behavior directly.
MolToSmilesTokenInventory(...) returns the union of fragment inventories plus the "." separator token.
MolToSmilesTokenInventorySuperset(...) returns the corresponding static fragment inventory union plus the "." separator token.

Docs

License

grimace is source-available under PolyForm Noncommercial 1.0.0. Third-party components remain under their own licenses; see THIRD_PARTY_NOTICES.md. Commercial use requires a separate commercial license from the author. The software is provided as is, without warranty or liability, to the extent allowed by law.

Name		Name	Last commit message	Last commit date
Latest commit History 424 Commits
.github/workflows		.github/workflows
docs		docs
notes		notes
python/grimace		python/grimace
rust		rust
scripts		scripts
tests		tests
tmp/exploration		tmp/exploration
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRIMACE

Choose the API

Important runtime requirements today

Quickstart

1. Enumerate the exact support

2. Decode online, one token at a time

3. Merge same-text choices when you only care about token text

4. Diagnose a candidate serialization

5. Ask for the exact token inventory

What counts as a token?

Installation

Timings

Current limits

Docs

License

About

Uh oh!

Releases 11

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GRIMACE

Choose the API

Important runtime requirements today

Quickstart

1. Enumerate the exact support

2. Decode online, one token at a time

3. Merge same-text choices when you only care about token text

4. Diagnose a candidate serialization

5. Ask for the exact token inventory

What counts as a token?

Installation

Timings

Current limits

Docs

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Contributors

Uh oh!

Languages