MotifML

MotifML is a Kedro-based symbolic-music research pipeline for turning Motif exports into deterministic IR corpora and downstream ML-ready datasets.

The repository currently covers raw-corpus ingestion, canonical IR build and validation, fixture-backed regression artifacts, deterministic split planning, and baseline normalization, feature-extraction, tokenization, and baseline decoder-only Transformer training and evaluation stages for downstream experiments.

Current Scope

ingest source corpora into canonical Motif JSON with deterministic manifests and summaries
build canonical IR documents plus corpus manifests, validation reports, and scale summaries
track regression surfaces through fixtures, golden IR artifacts, inspection bundles, tracked training fixtures, and inspection notebooks
plan deterministic score-level experiment splits with persisted manifests and split summaries
project IR documents into sequence, graph, and hierarchical feature views
package baseline model-input artifacts under Kedro
train the baseline decoder-only Transformer through Kedro-managed checkpoints and reporting
evaluate the best baseline checkpoint with quantitative metrics, structural checks, and qualitative decoded samples

Generation pipelines are not implemented yet.

Repository Layout

src/motifml/           Core library and Kedro pipeline code
conf/                  Shared Kedro catalog, parameters, and logging config
docs/source/           Versioned overview, guides, and technical reference docs
notebooks/             Inspection and exploration notebooks
tests/                 Unit, integration, and fixture-backed regression tests
tools/                 Fixture and inspection-bundle regeneration scripts

The data directory follows Kedro's staged layout:

data/
├── 00_corpus/
├── 01_raw/
├── 02_intermediate/
├── 03_primary/
├── 04_feature/
├── 05_model_input/
├── 06_models/
├── 07_model_output/
└── 08_reporting/

Tracked source fixtures live under tests/fixtures/. Runtime corpora and model artifacts under data/ are intentionally excluded from version control.

Getting Started

Create the environment:

uv venv --python 3.11
uv sync --extra dev

Add your source corpus anywhere under data/00_corpus/ and place the Motif CLI binary at tools/motif-cli.

Build and summarize the raw Motif JSON corpus:

uv run kedro run --pipelines=ingestion

Run the full default preprocessing pipeline through tokenization:

uv run kedro run --async

Run the canonical single-command baseline training path:

uv run kedro run --pipelines=baseline_training

The default pipeline intentionally stops at 05_model_input; heavy training work lives behind the explicit baseline_training pipeline so maintainers can opt into it deliberately.

Run the canonical single-command end-to-end baseline review path:

uv run kedro run --pipelines=baseline_training_evaluation

This path runs the baseline from raw corpus inputs through evaluation outputs and is the fastest way to refresh the full 05_model_input + 06_models + 07_model_output + 08_reporting review surface in one go.

Run baseline evaluation after training artifacts exist:

uv run kedro run --pipelines=evaluation

The evaluation pipeline reuses the persisted 05_model_input, frozen vocabulary, and best checkpoint from baseline_training, then writes decoded sample tables under 07_model_output plus metrics and Markdown review artifacts under 08_reporting.

Launch inspection tools when needed:

uv run kedro viz
uv run jupyter lab

The training notebooks under notebooks/ can inspect either runtime outputs under data/ or a temporary artifact root exposed through MOTIFML_TRAINING_ARTIFACT_ROOT.

Fixtures and Inspection Artifacts

Regenerate the tracked raw fixtures and golden IR subset:

uv run python tools/regenerate_ir_fixture_corpus.py

Regenerate the tracked inspection bundles:

uv run python tools/generate_ir_inspection_bundles.py

Regenerate the tracked training fixtures and smoke bundle:

uv run python tools/regenerate_training_fixtures.py

The core project documentation is organized as:

docs/source/overview/
docs/source/guides/
docs/source/reference/

Useful entry points include:

docs/source/index.rst
docs/source/guides/contributing.rst
docs/source/guides/ir_engineering.rst
docs/source/guides/training_workflow.rst
docs/source/guides/inspection_artifacts.rst
docs/source/reference/ir_contract.rst
docs/source/reference/training_contract.rst

Development

Run the main verification commands with:

uv run ruff check . --fix
uv run ruff format .
uv run mypy src
uv run pytest

Run pre-commit hooks across the repository with:

uv run pre-commit run --all-files

License

MIT; see LICENSE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.github		.github
conf		conf
data		data
docs/source		docs/source
notebooks		notebooks
src/motifml		src/motifml
tests		tests
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.telemetry		.telemetry
AGENTS.md		AGENTS.md
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MotifML

Current Scope

Repository Layout

Getting Started

Fixtures and Inspection Artifacts

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MotifML

Current Scope

Repository Layout

Getting Started

Fixtures and Inspection Artifacts

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages