MotifML is a Kedro-based symbolic-music research pipeline for turning Motif exports into deterministic IR corpora and downstream ML-ready datasets.
The repository currently covers raw-corpus ingestion, canonical IR build and validation, fixture-backed regression artifacts, deterministic split planning, and baseline normalization, feature-extraction, tokenization, and baseline decoder-only Transformer training and evaluation stages for downstream experiments.
- ingest source corpora into canonical Motif JSON with deterministic manifests and summaries
- build canonical IR documents plus corpus manifests, validation reports, and scale summaries
- track regression surfaces through fixtures, golden IR artifacts, inspection bundles, tracked training fixtures, and inspection notebooks
- plan deterministic score-level experiment splits with persisted manifests and split summaries
- project IR documents into sequence, graph, and hierarchical feature views
- package baseline model-input artifacts under Kedro
- train the baseline decoder-only Transformer through Kedro-managed checkpoints and reporting
- evaluate the best baseline checkpoint with quantitative metrics, structural checks, and qualitative decoded samples
Generation pipelines are not implemented yet.
src/motifml/ Core library and Kedro pipeline code
conf/ Shared Kedro catalog, parameters, and logging config
docs/source/ Versioned overview, guides, and technical reference docs
notebooks/ Inspection and exploration notebooks
tests/ Unit, integration, and fixture-backed regression tests
tools/ Fixture and inspection-bundle regeneration scripts
The data directory follows Kedro's staged layout:
data/
├── 00_corpus/
├── 01_raw/
├── 02_intermediate/
├── 03_primary/
├── 04_feature/
├── 05_model_input/
├── 06_models/
├── 07_model_output/
└── 08_reporting/
Tracked source fixtures live under tests/fixtures/. Runtime corpora and model artifacts
under data/ are intentionally excluded from version control.
Create the environment:
uv venv --python 3.11
uv sync --extra devAdd your source corpus anywhere under data/00_corpus/ and place the
Motif CLI binary at tools/motif-cli.
Build and summarize the raw Motif JSON corpus:
uv run kedro run --pipelines=ingestionRun the full default preprocessing pipeline through tokenization:
uv run kedro run --asyncRun the canonical single-command baseline training path:
uv run kedro run --pipelines=baseline_trainingThe default pipeline intentionally stops at 05_model_input; heavy training work lives
behind the explicit baseline_training pipeline so maintainers can opt into it
deliberately.
Run the canonical single-command end-to-end baseline review path:
uv run kedro run --pipelines=baseline_training_evaluationThis path runs the baseline from raw corpus inputs through evaluation outputs and is the
fastest way to refresh the full 05_model_input + 06_models + 07_model_output +
08_reporting review surface in one go.
Run baseline evaluation after training artifacts exist:
uv run kedro run --pipelines=evaluationThe evaluation pipeline reuses the persisted 05_model_input, frozen vocabulary, and
best checkpoint from baseline_training, then writes decoded sample tables under
07_model_output plus metrics and Markdown review artifacts under 08_reporting.
Launch inspection tools when needed:
uv run kedro viz
uv run jupyter labThe training notebooks under notebooks/ can inspect either runtime outputs under
data/ or a temporary artifact root exposed through MOTIFML_TRAINING_ARTIFACT_ROOT.
Regenerate the tracked raw fixtures and golden IR subset:
uv run python tools/regenerate_ir_fixture_corpus.pyRegenerate the tracked inspection bundles:
uv run python tools/generate_ir_inspection_bundles.pyRegenerate the tracked training fixtures and smoke bundle:
uv run python tools/regenerate_training_fixtures.pyThe core project documentation is organized as:
docs/source/overview/docs/source/guides/docs/source/reference/
Useful entry points include:
docs/source/index.rstdocs/source/guides/contributing.rstdocs/source/guides/ir_engineering.rstdocs/source/guides/training_workflow.rstdocs/source/guides/inspection_artifacts.rstdocs/source/reference/ir_contract.rstdocs/source/reference/training_contract.rst
Run the main verification commands with:
uv run ruff check . --fix
uv run ruff format .
uv run mypy src
uv run pytestRun pre-commit hooks across the repository with:
uv run pre-commit run --all-filesMIT; see LICENSE.md.