Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
f7f9aae
wip
paarthneekhara Apr 15, 2026
6cf3e02
WIP
paarthneekhara Apr 16, 2026
ed01ceb
speaker encoder optional
paarthneekhara Apr 19, 2026
74cd9cd
Apply isort and black reformatting
paarthneekhara Apr 22, 2026
382fb95
add option to remove text embedding and lm head
paarthneekhara May 2, 2026
3899c7e
cas encoder layers
paarthneekhara May 2, 2026
3711b3a
use IPA as text prob added during training
paarthneekhara May 3, 2026
a5e00b3
Add multiturn dataloader
Edresson Apr 13, 2026
ca4ca99
Update multiturn dataloader
Edresson Apr 14, 2026
3b44776
Add multiturn config
Edresson Apr 14, 2026
7048d3c
Add a intermediary fix for prior
Edresson Apr 15, 2026
1a56d9e
Fix audio tokens name
Edresson Apr 15, 2026
7bcaa8e
Add formatter to support json dataset on lhotse inference
Edresson Apr 15, 2026
a741d44
Update inference to support multiturn dataloader
Edresson Apr 16, 2026
4e2e236
Bug fix in dataloder
Edresson Apr 17, 2026
9cf7584
Add parameter to remove user turns
Edresson Apr 17, 2026
7ad497d
Add multiturn inference script
Edresson Apr 21, 2026
61e2a3e
Update inference script
Edresson Apr 21, 2026
56edfbb
Remove unused codes
Edresson Apr 21, 2026
4f1a87e
Update inference recipe
Edresson Apr 22, 2026
03e0e47
Remove librosa resample
Edresson Apr 22, 2026
867e163
Add silence aug
Edresson Apr 22, 2026
7f5d81e
Add silence tts data augmentation
Edresson Apr 23, 2026
6aa93ea
Add parameter to remove subword text conditioning
Edresson Apr 27, 2026
60dbe33
Add support for extra duplex dataloaders
Edresson Apr 27, 2026
481543e
Add partial loading
Edresson Apr 28, 2026
3a38620
Fix interruption handling for validation dataset
Edresson Apr 28, 2026
0587411
Add user silence mask
Edresson Apr 30, 2026
d4d2c69
Add use_user_speaking_token
Edresson May 1, 2026
d4c29ae
IPA handling in multiturn data
shehzeen May 2, 2026
1062f8b
Fix inference script
Edresson May 4, 2026
4ab752b
Add transition tokens on loss
Edresson May 4, 2026
5ab5340
Add extra parameter type
Edresson May 4, 2026
807e3a4
Fix augmentation
Edresson May 5, 2026
0b11e36
Add min_number_of_turns and max_gap_duration_collapse_turns
Edresson May 5, 2026
5bd03d1
Add phoneme multiturn inference support and update silence augmentation
Edresson May 7, 2026
70cf559
Fix sil augmentation on formatter
Edresson May 8, 2026
4bbfd69
Fix inference
Edresson May 8, 2026
cc38c61
Add new inference script and fix data formatter
Edresson May 11, 2026
d2cd65b
Fix merge issue
Edresson May 12, 2026
cd955e1
Add restore custom checkpoint to avoid full model loading on .nemo ch…
Edresson May 13, 2026
bd764f8
Add raw tts data support on TTS dataloader
Edresson May 14, 2026
9a1c0d3
Remove complex prefil code and add slupport to nemotron_h on prefill
Edresson May 14, 2026
0afe005
Add user aduio conditioning
Edresson May 17, 2026
a9e106e
Add multiturn inference support with user audio conditioning
Edresson May 18, 2026
8ed4363
Add silence if user audio is not available
Edresson May 19, 2026
c9ade51
Add multiturn augmentations
Edresson May 19, 2026
c59c8f8
Add update inference
Edresson May 20, 2026
6a23934
Add use_explicit_silence_for_streaming_audio_delay
Edresson May 20, 2026
bf402b7
Add new trim augmentation
Edresson May 21, 2026
9f64580
Add new trim aug
Edresson May 21, 2026
183b2b6
remove use_explicit_silence_for_streaming_audio_delay
Edresson May 21, 2026
126970c
Add new inference
Edresson May 21, 2026
460f427
Update inference script
Edresson May 22, 2026
37e8b3b
Fix phoneme loss
Edresson May 25, 2026
0f2975c
Remove debug print
Edresson May 25, 2026
0bd3ff7
Fix phoneme inference
Edresson May 25, 2026
f76314f
Add phoneme_loss_mask_padding
Edresson May 26, 2026
0c254c7
Update inference
Edresson May 26, 2026
d974415
Add full user prefill support on nemotron_h class
Edresson May 26, 2026
bb7022c
Add phoneme_loss_mask_agent_expanded
Edresson May 27, 2026
aea4c3c
Rename phoneme_loss_mask_include_transition
Edresson May 27, 2026
10d8904
Remove unused code
Edresson May 28, 2026
c01730f
Add partial copy support
Edresson May 28, 2026
a2a21d2
Add parameter to drop all turn in sample
Edresson May 29, 2026
d52449e
phoneme turn dropout
shehzeen May 30, 2026
649f758
filewise metrics and aggregated metrics in the inference script
shehzeen Jun 1, 2026
e32d4e1
Add multigpu inference script
Edresson Jun 1, 2026
7cea0ef
phoneme pad for short turns
shehzeen Jun 3, 2026
ed8007f
ignore punctuation in word counting
shehzeen Jun 4, 2026
5d130dc
Add turn based metrics
Edresson Jun 6, 2026
646e576
Remove unused methods
Edresson Jun 8, 2026
dea328c
Add new easymagpie compatible inference script
Edresson Jun 8, 2026
6ab2a2e
Update EasyMagpie inference script to support multiturn
Edresson Jun 8, 2026
585decb
Fix new inference volume norm
Edresson Jun 8, 2026
032e8df
Expose ASR and EOU batch sizes on config
Edresson Jun 9, 2026
79a7670
Get language from dataloader for multiturn eval
Edresson Jun 9, 2026
2aa37fe
Remove old multiturn eval scripts
Edresson Jun 9, 2026
0149169
Undo unecessary changes on cutset
Edresson Jun 9, 2026
54f94f4
Remove unused user_audio_mask
Edresson Jun 9, 2026
0910c68
Aplly Black
Edresson Jun 9, 2026
cf4ae23
Remove unused params
Edresson Jun 9, 2026
eb239d3
short phoneme turn handling
shehzeen Jun 9, 2026
341af89
Bug fix on metrics computation
Edresson Jun 10, 2026
20f777b
Add ECAPA2 SSIM
Edresson Jun 16, 2026
23a44f2
Revert "Add ECAPA2 SSIM"
Edresson Jun 16, 2026
94af99e
Add Speaker filter
Edresson Jun 18, 2026
a94514f
Add support for the text pretraining checkpoint
Edresson Jun 20, 2026
84ba0c9
Revert "Add support for the text pretraining checkpoint"
Edresson Jun 20, 2026
d766b7a
Add emotion cosine similarity and emotion match rate metric
Edresson Jun 20, 2026
106e4fe
Update emotion_embedding_type default to score_vector
Edresson Jun 20, 2026
e4864ed
infererence user turn end token fix
shehzeen Jun 21, 2026
f2c4b23
Update
Edresson Jun 22, 2026
c849cba
Add strip_text_annotations_for_metrics parameter
Edresson Jun 22, 2026
a98bf47
Remove symbolic links on multiturn eval and added ground truth multi…
Edresson Jun 22, 2026
f03723a
Fix empty turn issue for no annotation data
Edresson Jun 22, 2026
1b16091
Fix short words issue
Edresson Jun 22, 2026
0b65f2f
Add phoneme prediction on exportable .json and .csv files
Edresson Jun 23, 2026
6a2190d
Fix try except errors and fix normalization
Edresson Jun 23, 2026
5978897
Clean up unused code
Edresson Jun 23, 2026
609bce6
Apply Black
Edresson Jun 23, 2026
82d1efa
Fix imports order
Edresson Jun 23, 2026
e987ee1
Add unit tests
Edresson Jun 23, 2026
da88283
Fix black check
Edresson Jun 23, 2026
500c845
Replace instantiate with safe_instantiate
Edresson Jun 24, 2026
dc1d48b
Fix unit tests
Edresson Jun 24, 2026
eb6b252
Add turn level volume normalization
Edresson Jun 24, 2026
b4aef5e
remove configurable cas encoder layers since it is not needed
shehzeen Jun 24, 2026
3b5077f
Left silence pad short user audios for safety
Edresson Jun 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions examples/tts/conf/magpietts/easy_magpietts_lhotse_multiturn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
name: Magpie-TTS-DecoderOnly-EN

quadratic_duration: 20

# Adjust batch size based on GPU memory
# When doing weighted sampling with multiple manifests, this defines how many training steps are in an epoch.
# If null, then weighted sampling is disabled.

model:
use_lhotse: true

# Decoder backend selection
# Options: "huggingface" (default), "nemotron_h"
decoder_type: "huggingface"

# HuggingFace backend config (used when decoder_type: "huggingface")
transformer_hf_backend: "Qwen/Qwen2.5-1.5B"

# NemotronH config (used when decoder_type: "nemotron_h")
# Hybrid Mamba2/MoE/Attention model (~3B total, ~600-800M active). Layer types via hybrid_override_pattern:
# 'M' = Mamba2 layer, '*' = Attention layer, '-' = MLP layer, 'E' = MoE layer
nemotron_h_config:
hidden_size: 1536 # Should match embedding_dim
num_hidden_layers: 48
vocab_size: 131072
# Attention config
num_attention_heads: 12
num_key_value_heads: 4
attention_dropout: 0.0
attention_bias: false
max_position_embeddings: 8192
# Mamba config
mamba_num_heads: 64
mamba_head_dim: 24
ssm_state_size: 128
conv_kernel: 4
n_groups: 8
chunk_size: 256
mamba_hidden_act: "silu"
use_conv_bias: true
use_bias: false
# MLP config
intermediate_size: 4096
mlp_hidden_act: "silu"
mlp_bias: false
# MoE config (scaled from Nemotron-3-Nano-30B-A3B)
n_routed_experts: 48
num_experts_per_tok: 6
moe_intermediate_size: 1024
moe_shared_expert_intermediate_size: 2048
n_group: 1
topk_group: 1
routed_scaling_factor: 2.5
norm_topk_prob: true
# Layer pattern: (M E M E M *) x 8 => 16 Mamba, 16 MoE, 8 Attention
hybrid_override_pattern: "MEMEM*MEMEM*MEMEM*MEMEM*MEMEM*MEMEM*MEMEM*MEMEM*"
# Normalization
layer_norm_epsilon: 1e-5
residual_in_fp32: true

use_text_conditioning_encoder: true # If true, distilbert will be used to encode context_text if provided.
context_duration_min: 5.0
context_duration_max: 5.0
load_cached_codes_if_available: true

embedding_dim: 1536
hidden_dim: 1536
audio_embedding_dim: 1536 # Can set a smaller dimension for audio embeddings to reduce parameters. Set equal to hidden_dim for no projection.
codecmodel_path: ???

# Local transformer parameters for autoregressive codebook prediction within a frame
local_transformer_type: "autoregressive" # "none", "autoregressive"
# Below args are only relevant if use_local_transformer is autoregressive
local_transformer_loss_scale: 1.0
phoneme_loss_weight: 1.0
local_transformer_n_layers: 3
local_transformer_n_heads: 12
local_transformer_hidden_dim: 1536

cfg_unconditional_prob: 0.05

# Multi-mode training configuration
training_modes:
- text_input_mode: "streaming" # Options: "full", "streaming"
streaming_phonemes_delay: 0
streaming_speech_delay: 1

frame_stacking_factor: 2
phoneme_stacking_factor: 1
phoneme_confidence_unk_threshold: 0.0 # If max phoneme probability is below this threshold at inference-time, replace the predicted timestep with UNK to reduce error propagation.
dropout_text_input_prob: 0.1
phoneme_corruption_batch_prob: 0.1
phoneme_corruption_timestep_ratio: 0.15
phoneme_corruption_unk_mode_prob: 0.5
phoneme_corruption_type: "repeat_skip_unk" # "repeat_skip_unk" or "complete_channel"
phoneme_turn_dropout_batch_prob: 0.0 # prob of applying turn dropout to a sample
phoneme_turn_dropout_turn_prob: 0.0 # prob of dropping each phoneme turn within a sample
phoneme_turn_max_words_to_drop: 0 # turns with <= this many words keep phoneme tokens as pad_id

phoneme_tokenizer:
_target_: nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPABPETokenizer
tokenizer_path: ???

text_tokenizers:
nemotron_nano_30b:
_target_: AutoTokenizer
pretrained_model: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"

train_ds:
use_lhotse: ${model.use_lhotse}
volume_norm: true
dataset:
multi_config: true
shuffle: true
seed: 42
shard_seed: "trng"

sampler_fusion: randomized_round_robin
sampler_weights:
tts_data: 0.5
duplex_data: 0.5
tts_data:
min_duration: 0.2
min_context_speaker_similarity: 0.6
max_cer: 0.03
batch_duration : ??? # in seconds. Adjust based on your GPU memory.
quadratic_duration: ${quadratic_duration}
use_bucketing: true
num_buckets: 20
bucket_buffer_size: 10_000
shuffle_buffer_size: 10_000
num_cuts_for_bins_estimate: 10_000
shard_seed: "trng"
drop_last: true
shuffle: true
num_workers: 6
pin_memory: true

input_cfg:
- type: lhotse_shar
shar_path: ???
weight: 1.0
tags:
tokenizer_names: ["english_phoneme"]

duplex_data:
input_cfg: /lustre/fsw/convai_convaird_nemo-speech/data/duplex/multispeaker_syn_duplex.yaml
use_bucketing: true
num_buckets: 20
bucket_buffer_size: 1_000
shuffle_buffer_size: 1_000
num_cuts_for_bins_estimate: 1_000
max_duration: 300 # 5 mi max duration
bucket_duration_bins: [4.0, 8.9, 10.2, 11.6, 13.2, 15.0, 17.0, 19.3, 25.0, 31.5, 38.5, 46.0, 55.5, 66.5, 79.5, 93.3, 110.0, 130.0, 156.8, 203.3]
bucket_batch_size: [75, 33, 29, 25, 23, 20, 18, 15, 12, 10, 8, 7, 5, 4, 3, 3, 2, 2, 1, 1]


validation_ds:
use_lhotse: ${model.use_lhotse}
volume_norm: true

dataset:
min_duration: 0.2
min_context_speaker_similarity: 0.6
max_cer: 0.03
batch_duration: ??? # recommend to use smaller batch_duration for validation dataset than training dataset.
quadratic_duration: ${quadratic_duration}
use_bucketing: false
force_finite: true
force_map_dataset: true
drop_last: false
shuffle: false
num_workers: 2
pin_memory: true
seed: 42
shard_seed: "randomized"

input_cfg:
- type: lhotse_shar
shar_path: ???
weight: 1.0
tags:
tokenizer_names: ["english_phoneme"]

optim:
_target_: torch.optim.AdamW
lr: 1e-4

sched:
name: ExponentialLR
gamma: 0.998

trainer:
num_nodes: 1
devices: -1
accelerator: gpu
strategy: ddp_find_unused_parameters_true
precision: bf16-mixed
max_steps: ???
accumulate_grad_batches: 1
enable_checkpointing: False # Provided by exp_manager
logger: false # Provided by exp_manager
log_every_n_steps: 100
limit_train_batches: 1_000
val_check_interval: 1_000
num_sanity_val_steps: 0
benchmark: false
use_distributed_sampler: false # required because Lhotse has its own handling
gradient_clip_val: 2.5

exp_manager:
exp_dir: null
name: ${name}
create_tensorboard_logger: true
create_wandb_logger: false
wandb_logger_kwargs:
entity: null
name: ${name}
project: null
group: null
resume: true
create_checkpoint_callback: true
checkpoint_callback_params:
monitor: val_loss
mode: min
save_top_k: 5
save_best_model: true
always_save_nemo: true
filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.4f}-{step}-{epoch}'
resume_if_exists: true
resume_ignore_no_checkpoint: true
3 changes: 3 additions & 0 deletions examples/tts/easy_magpietts.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ def main(cfg):
else:
raise NotImplementedError(f"Only train, onlinepo_train and test modes are supported. Got {mode}")

if cfg.get("pretrained_model", None):
model.restore_from_pretrained_checkpoint(cfg.pretrained_model)

model.maybe_init_from_pretrained_checkpoint(cfg=cfg)

if mode in ['train', 'onlinepo_train']:
Expand Down
Loading
Loading