This repository implements a text-to-MIDI generation system that creates symbolic music from natural language descriptions.
MSc Applied Artificial Intelligence - ICESI University
This project is part of the MSc in Applied Artificial Intelligence program at Universidad ICESI, Colombia. It focuses on symbolic music generation using deep learning techniques to bridge natural language and musical representation.
Specific Objectives
- Review the state-of-the-art related to symbolic music generation models from textual descriptions.
- Deepen the understanding and explanation of the training and generation process of the Transformer architecture model, detailing how it transforms text into symbolic music.
- Design a prototype for the conversion of natural language descriptions into symbolic music, evaluating its impact on music generation.
This project requires Python 3.12+ and uv package manager
# Clone the repository
git clone https://github.com/NickEsColR/symbolic-music-generation.git
cd symbolic-music-generation
# Install dependencies with uv
uv syncIf you plan to use natural language translation via LLM (e.g., Gemini) by passing --translator-model in the CLI, you must set up your environment variables:
- Copy the example
.envfile or create a new one:
cp .env.example .env- Add your Google API Key to the
.envfile:
GOOGLE_API_KEY=your_api_key_hereTo run the generation models effectively, please note the following hardware constraints:
- CPU Evaluation: The sequence evaluation phase (scoring generated MIDI with musical heuristics) is highly CPU-intensive.
- VRAM Requirements (MidiLLM): The MidiLLM model is the heaviest in the system and requires a dedicated GPU with at least 4GB of VRAM for stable execution. However, despite being heavier, MidiLLM generates music significantly faster due to its batch processing capabilities.
- Execution Time (Text2Midi): Generation time for the Text2Midi model grows exponentially as the number of branches in beam search increases, because these searches are evaluated sequentially rather than in parallel.
Note: All notebooks can be run directly in Google Colab using their respective badges below.
This notebook addresses the first part of Objective 2 by providing a deep dive into the encoder-decoder architecture and the training process of the model.
The analysis covers:
- Dataset Preparation: Preprocess SymphonyNet dataset using music21 to extract tempo, key, BPM, and instruments
- Pseudo-Caption Generation: Create template-based captions for pre-training
- Model Architecture:
- Encoder: FlanT5-base for text processing
- Tokenizer: REMI+ for MIDI tokenization
- Decoder: Processes encoded text and tokenized MIDI
- Training: Complete training loop with the prepared datasets
This notebook addresses the second part of Objective 2 by focusing on the generation process. It explores how to use the pre-trained text2midi model to generate symbolic music from natural language prompts.
The analysis covers:
- Environment Setup & Model Loading: Preparing the test lab and configuring the Hugging Face model for internal data extraction.
- Initial Processing: Tokenization and Embeddings, observing how text translates to a latent space.
- Encoder Journey: Analyzing self-attention and contextual understanding of the textual prompt.
- Decoder & MIDI Generation: Deconstructing the autoregressive mechanism and cross-attention step-by-step.
- Raw Output & Final Reconstruction: Decoding model logits back into playable MIDI files using REMI+.
This notebook addresses Objective 3 by evaluating the complete MIDI generation pipeline from natural language descriptions. It compares two distinct generation strategies and assesses their impact on the final musical output.
The analysis covers:
- Pipeline & Environment Setup: Configuration of translators and necessary dependencies to run the end-to-end flow.
- Base Prompt Definition: Establishment of the natural language request ("Una melodía de piano melancólica...") to be processed.
- Flow 1: Text2Midi (Progressive Search): Generation using the
one-shotprofile to evaluate progressive search capabilities. - Flow 2: MidiLLM (Best-of-N): Generation using the
midillm-fastprofile to evaluate batch generation and selection. - Conclusions & Review: Comparative analysis of the resulting technical prompts, instrumentation adherence, tempo biases, and harmonic progression accuracy.
The project use jupytext to create jupyter cells using a .py file.
To syncronized the .ipynb with the .py run
uv run jupytext --set-formats ipynb,py:percent notebooks/notebook.ipynbTo syncronized the .py with the .ipynb run
uv run jupytext --sync notebooks/notebook.pyThis document addresses the structural design of the project, focusing on how the MIDI generation pipeline is built for maintainability and extensibility.
The overview covers:
- Domain Layer: Core business entities and interfaces (
src/domain/). - Use Cases Layer: Orchestration logic for
ProgressiveSearchandBestOfNSearch(src/use_cases/). - Adapters Layer: Integrations with external models, translators, and evaluators (
src/adapters/). - Configuration & Models: Predefined profiles and neural network definitions.
- Orchestration: The Dependency Injection pipeline that wires the architecture together (
pipeline.py).
The project includes a robust CLI to generate MIDI files directly from the terminal.
Basic Usage:
python -m src.cli --text "A peaceful piano melody" --output peaceful.midFor a comprehensive guide on all available arguments (like --profile, --translator-model, and --strict-instruments) and advanced usage examples, refer to the CLI Reference.
The project includes a comprehensive test suite covering the domain, use cases, and adapters. To run the tests, use pytest via uv:
uv run pytest tests/When working locally, you should place downloaded model weights and vocabularies (e.g., the pre-trained text2midi model from Hugging Face) inside a models/ directory at the root of the project.
mkdir -p models/
# Download hugging face model bin and vocab to this folder- Size: 46,359 MIDI files
- Duration: 3,284 hours of music
- Content: Symphonic music with multiple instrument tracks
- Reference: SymphonyNet Dataset
- Features: MIDI files with rich text captions
- Metadata: Genre, mood, tempo, key, time signature, chord progressions, instrumentation
- Split: 90/10 train/test partition
- Reference: MidiCaps on HuggingFace
- Text2midi Paper: Text2midi: Generating Symbolic Music from Captions
- Pre-trained Model: text2midi on HuggingFace
- SymphonyNet Dataset: https://symphonynet.github.io/
- MidiCaps Dataset: https://huggingface.co/datasets/amaai-lab/MidiCaps
If you use this project or our research in your work, please cite the underlying models and papers:
@inproceedings{bhandari2025text2midi,
title={text2midi: Generating Symbolic Music from Captions},
author={Keshav Bhandari and Abhinaba Roy and Kyra Wang and Geeta Puri and Simon Colton and Dorien Herremans},
booktitle={Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)},
year={2025}
}@inproceedings{wu2025midillm,
title={{MIDI-LLM}: Adapting large language models for text-to-{MIDI} music generation},
author={Wu, Shih-Lun and Kim, Yoon and Huang, Cheng-Zhi Anna},
booktitle={Proc. NeurIPS AI4Music Workshop},
year={2025}
}
@misc{llama3researchcloud,
title={The Llama 3 Herd of Models},
author={Meta Llama Team},
year={2024},
eprint={2407.21783},
archivePrefix={arXiv},
primaryClass={cs.LG}
}